Chapter 8 Web Scraping
Most webpages are designed for humans to look and read. But sometimes we do not want to look and read, but collect the data from the pages instead. This is called web scraping. The challenge with web scraping is getting the data out of pages that are not designed for this purpose.
8.1 Before you begin
Web scraping means extacting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, built and maintained by individuals, businesses and governments. Extracting data from there inevitably means using the resources and knowledge someone else has put into the websites. So we have to be careful from both legal and ethical perspective.
From the ethical side, you should try to minimize the problems you cause to the websites you are scraping. This involves the following steps:
limit the number of queries to the necessary minimum. For instance, when developing your code, download the webpage once, and use the cached version for developing and debugging. Do not download more before you actually need more for further development. Do the full scrape only after the code as been well-enough tested. Store the final results in a local file.
limit the frequency of queries to something the server can easily handle. For a small non-profit, consider to send only a handful of requests per minute, while a huge business like google can easily handle thousands of requests per second (but they may recognize you scraping and block you).
consult the robots.txt file and understand what is allowed, what is not allowed. Do not download pages that the file does not allow to scrape.
robots.txt is a text file with simple commands for web crawlers, describing what the robots should and should not do. The file can obtained by adding robots.txt at the end of the base url of the website. For instance, the robots.txt for the web address https://ischool.uw.edu/events is at https://ischool.uw.edu/robots.txt as the base url is https://ischool.uw.edu. A robots.txt file may look like:
User-agent: *
Allow: /core/*.css$
Disallow: /drawer/
This means all crawlers (user agent *
) are allowed to read all files
ending with css in core,
e.g. https://ischool.uw.edu/core/main.css. But they are not allowed
to read anything from drawer,
e.g. https://ischool.uw.edu/drawer/schedule.html. There are various
simple introductions to robots.txt, see for instance
moz.com.
A related issue is legality. You should only scrape websites and services where it is legal. But in recent years it is getting more and more common for the sites to explicitly ban it. For instance, allrecipes.com states in Terms of Service that:
(e) you shall not use any manual or automated software, devices or
other processes (including but not limited to spiders, robots,
scrapers, crawlers, avatars, data mining tools or the like) to
"scrape" or download data from the Services ...
Some websites permit downloading for “personal non-commercial use”. GitHub states in its Acceptable Use Policies that
You may scrape the website for the following reasons:
* Researchers may scrape public, non-personal information from the
Service for research purposes, only if any publications resulting
from that research are open access.
* Archivists may scrape the Service for public data for archival purposes.
You may not scrape the Service for spamming purposes, including for
the purposes of selling User Personal Information (as defined in the
GitHub Privacy Statement), such as to recruiters, headhunters, and job
boards.
All use of data gathered through scraping must comply with the GitHub
Privacy Statement.
There is also a plethora of websites that do not mention downloading, robots and scraping. Scraping such pages is a legally gray area. Other websites that are concerned with what happens to the scraped data. Feasting at home states:
You may NOT republish my recipe(s). I want Feasting at Home to
remain the exclusive source for my recipes. Do not republish on
your blog, printed materials, website, email newsletter or even on
social media- always link back to the recipe.
Again, my recipes are copyrighted material and may not be
published elsewhere.
While the legal issues may feel like a nuisance for a technology enthusiast, web scraping touches genuine questions about property rights, privacy, and free-riding. After all, many website creators have done a real effort and spent non-trivial resources to build and maintain the website. They may make the data available for browsers (not scrapers!) to support their business plan. But scrapers do not help with their business plan, and in some case may forward the data to a competitor instead.
In certain cases it also rises questions of privacy. Scraping even somewhat personal data (say, public social media profiles) for a large number of people and connecting this data with other resources may be a privacy violation. If you do this for research purposes, you should store the scraped data in a secure location, and not attempt to identify the real people in data!
In conclusion, before starting your first scraping project, you should answer these questions:
- Is it ethical to download and use the data for the purpose I have in mind?
- What can I do to minimize burden to the service providers?
- Is it legal?
- How should I store and use data?
8.2 HTML Basics
This section introduces basics of HTML. If you are familiar with HTML then you can safely skip forward to Beautiful Soup section.
HTML (hypertext markup language) is a way to write text using tags and attributes to mark the text structure. HTML is the standard language of internet, by far the most webpages are written in html and hence one needs to understand some HTML in order to be able to process the web. Below we briefly discuss the most important structural elements from the web scraping point of view.
8.2.2 Overall structure
A valid html document contains the doctype declaration, followed by
<html>
tag (see the example below).
Everything that is important from the scraping
perspective is embedded in the html-element. Html-element in turn
contains two elements: head and body. Head contains various
header information, including the page title (note–it is not the
title that is displayed on page), stylesheets and other general
information. Body contains all the text and other visual elements
that the browser actually displays. So a minimalistic html-file might
look like:
<!DOCTYPE html>
<html>
<head>
<title>Mad Monk Ji Gong</title>
</head>
<body>
<h1>Ji Visits Buddha</h1>
<p>Ji Gong went to visit Buddha</p>
</body>
</html>
This tiny file demonstrates all the structure we have discussed so far:
- The first declaration in the file is DOCTYPE entry.
- All the content is embedded in the html-element.
- The html-element contains head and body elements.
- head includes element title, the title of the webpage. This is not what you see on the page, but browser may show it on the window title bar, and it may use it as the file name when you download and save the page.
- The body-element contains two elements: h1, this is the top title that is actually rendered on screen (typically big and bold), and a paragraph of text.
Because HTML elements are nested inside each other, we can depict an HTML page as a tree. The html element is the trunk that branches into two, head and body; and body in turn branches into other elements. Thinking about the page as a tree is a extremely useful way when designing code to navigate the it.
8.3 Beautiful Soup
This section discusses the basics of webscraping using Beautiful Soup (BS) library. It is the most popular web scraping library in python. In essence it is an HTML parser with added functionality to search tags, classes and attributes, and move up and down in the HTML tree. These notes only contain the very basic introduction, the complete documentation is available at crummy.com.
Normally one scrapes web by downloading the pages, using for instance
requests
library
and thereafter parsing the pages using BS as
the latter is devoted to parsing in-memory html pages.
8.3.1 Example html file
In the following we demonstrate the usage of library on an example page
<!DOCTYPE html>
<html>
<head>
<title>Mad Monk Ji Gong</title>
</head>
<body>
<h1>Li Visits Buddha</h1>
<p>This happened
<a href="https://en.wikipedia.org/wiki/Song_dynasty">Song Dynasty</a>.
during <strong>Guang Liang</strong>...</p>
The patchwork robe made for <h2>Li Begs for a Son</h2>
<p class="quote">When I wa strolling in the street,
<footnote>They lived in Linan</footnote> almost
everyone was calling me<span class="nickname">Virtuous Li</span> ...</p>
</body>
</html>
This file includes a number of common html tags and attributes and a suitably rich structure for explaining the basics of Beautiful Soup. If you want to follow the examples, then you may copy the file from here and save it as “scrape-example.html”, or download it directly from the repo.
8.3.2 Loading Beautiful Soup and opening the data
Beautiful Soup is located in bs4
module, so normally one loads it as
from bs4 import BeautifulSoup
The first task after importing the module is to load the html page.
When scraping data from web, we can do it using requests
module:
import requests
= requests.get("www.example.com") htmlPage
However, for now we work with a local file so we have to load the file instead:
= open("../files/scrape-example.html").read() htmlPage
This loads the example file into a string variable htmlPage
. If we
print the string, we will see the original html text. Next,
we parse it through BS:
= BeautifulSoup(htmlPage, "html.parser")
soup type(soup)
## <class 'bs4.BeautifulSoup'>
This results in a parsed object of class bs4.BeautifulSoup
.
“html.parser” tells to parse the object as html (not as xml or
something else). If we leave the parser unspecified in the
call then Beautiful Soup can usually figure it out, but this
may occasionally create issues, so better to always specify the
desired parser.
We are done with parsing! What is left is to navigate the resulting parsed structure and extract the elements of interest.
8.4 Finding elements on webpage
All the examples above were done for a very simple webpage. But modern website are often much more complex. How can one locate the elements of interest in the source?
For simple webpages one can just look at the page source in the browser, or open the html document in a text editor. The source can be easily translated into the parse tree navigation algorithm. But modern complex pages are often very complicated and hard to understand by just consulting the source.
For complex pages, the easiest approach is to use browsers’ developer tools. Modern browser, such as Firefox and Chrome, contain web developers’ tools. These can be invoked by Ctrl-Shift-I in both Firefox and Chromium on Linux and Windows, Cmd-option-I (⌘ - ⌥ - I) on Mac. However, these tools are also excellent means to locate elements of interest on the page.
A particularly useful tool is element picker (labeled both in Firefox and Chromium) that highlights the element in the html source that you point on the webpage. In the figure at right one can see that the menu elements are contained in “a” tags with class “gutter_item”. If we are interested in scraping the menu, we may try to locate such elements in the parsed html tree.
However, this approach only works if both browser and scraper actually download the same page. If browser shows you the javascript-enabled version targeted for your powerful new browser, but scraper gets a different version for non-javascript crawlers, one has to dive deeper into the developer tools and fool the website to send the crawler version to the browser too.
Finally, one can always just download the page, parse its html and walk through the various elements in a search for the data in question. But this is often tedious even for small pages.
Iterable collection is something we can loop over↩︎