About web scraping...



In this module, we will talk about a very popular data collection method -- especially when it comes to data science side projects -- and that's web scraping. As I mentioned in the intro video, I have already added web scraping tutorial articles to the public data36 blog, but in this module I want to show you another example of web scraping in video format, as well.

Before we start with that, I want to talk about web scraping in general. Because similarly to APIs, I feel that web scraping is a bit over-mystified right now -- but in fact, once you get your Python script to work, you'll see that it's not much code or much of a challenge.

So here are 6 things to consider before you get into web scraping:

#1 Web scraping and APIs

APIs and web scraping are quite similar to each other. Especially if you are using Python. I mean, you even have to use the same requests library and its functions. And the general logic behind the two is practically the same, too: you want to query a structured data source that is available on a public url. In the case of APIs, this data was available as a json file -- while in the case of websites it will obviously be an html file.

Actually, before you get started with web scraping, I really recommend that you go through the API module first, because everything that you'll learn there, will make your web scraping skills better, too.

#2 Is there an API for it?

Since we are talking about APIs!

Sometimes web scraping can get tricky and come with headaches... you can avoid that if you find an API solution for the data you want to get. Take the OpenWeather API as an example. You could scrape weather data from various pages, and try to extract it with a lot of html tricks. If you do so, you can just hope that the website owners won't change the website structure and break your scripts... Or you can use the OpenWeather API where you'll get a well-structured bullet-proof way to gather weather data with a few lines of code. The rule of thumb is this: When there is an API solution for the thing you need, don't use web scraping.

#3 Is it blocked? Is it legal?

Before you scrape a website, you have to check out whether it is legal to scrape that website. Generally speaking, web scraping is not regulated and there are not many precedents. So if you scrape a website for a hobby project that you don't monetize, there is not too much legal risk involved.

With the disclaimer, that I'm not a legal professional and this is not legal advice: I'd say that one shouldn't worry about scraping websites from a legal standpoint. Just in case, I attached a more detailed article about the legal aspect of web scraping to this video, as well.

But I want to mention two things:

First, websites that don't want to let you scrape their content, usually block web scraping scripts. If you see that your web scraping scripts are blocked, my recommendation is to leave that site alone and go for another website to scrape.

The other thing is that most websites have a Terms of Use page. You might want to check out that page and if you see explicitly said that they don't want you to scrape their website, again, just leave that website alone and scrape something else.

#4 Is it simple html?

Most websites use the simple html format -- sort of the gold-standard of the internet.

But some websites make it more complex with java-script based interactive content and other goodies. In theory, you can scrape everything but until you become a black-belt web-scraper guru, I recommend that you go for simple .html websites. The interactive and dynamic special websites usually come with a lot of added complexity when you want to scrape them… so just skip these and search for simpler .html sites.

#5 Will the website change?

If you want to automate your web scraping project and for some reason you want to query the web page every day or every week in the future, you should consider that it will come with some maintenance work. The most common issue is that websites change every now and then -- and when the website's structure changes, your web scraping script will probably fail. That means that you'll have to update it. You can't really prepare for this -- except one thing, by scraping big traditional websites… (or you can use APIs instead, as I mentioned earlier.) But if you stick with web scraping, for example, I picked Wikipedia as a demo project for this module, because I expect that its structure will remain the same at least for the next 10 years. I hope I'll be right. But you should think similarly when you build your project on the top of your web scraping scripts.

#6 Web scraping projects almost always need some creative problem solving.

Well, the scraping itself usually is a standard process. You'll see. But once you scrape the .html content itself, you'll almost always see some sort of tricky thing in it. An unusual web page structure, a dynamic parameter in the web page urls, a missing data point, an inconsistent site structure… And as usual in data science, you'll have to do some sort of creative problem solving. There's no best practice for how to do that, as it differs from case to case -- well, that's the beauty of working with real life data, right? -- I just wanted to mention this, so it doesn't hit you unexpectedly!

Okay, that's enough for a short introduction to web scraping -- in the next lectures, I'll link all the tutorials you need and I'll show you a demo web scraping project that I've done with wikipedia.

Complete and Continue