Chapter 10 Web Scraping

Most webpages are designed for humans to look and read. But sometimes we do not want to look and read, but collect the data from the pages instead. This is called web scraping. The challenge with web scraping is getting the data out of pages that are not designed for this purpose.

10.1 Before you begin

Web scraping means extracting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, built and maintained by individuals, businesses and governments. Extracting data from there inevitably means using the resources and knowledge someone else has put into the websites. So we have to be careful from both legal and ethical perspective.

From the ethical side, you should try to minimize the problems you cause to the websites you are scraping. This involves the following steps:

limit the number of queries to the necessary minimum. For instance, when developing your code, download the webpage once, and use the cached version for developing and debugging. Do not download more before you actually need more for further development. Do the full scrape only after the code as been well-enough tested. Store the final results in a local file.
limit the frequency of queries to something the server can easily handle. For a small non-profit, consider to send only a handful of requests per minute, while a huge business like google can easily handle thousands of requests per second (but they may recognize you scraping and block you).
consult the robots.txt file and understand what is allowed, what is not allowed. Do not download pages that the file does not allow to scrape.

robots.txt is a text file with simple commands for web crawlers, describing what the robots should and should not do. The file can obtained by adding robots.txt at the end of the base url of the website. For instance, the robots.txt for the web address https://ischool.uw.edu/events is at https://ischool.uw.edu/robots.txt as the base url is https://ischool.uw.edu. A robots.txt file may look like:

User-agent: *
Allow: /core/*.css$
Disallow: /drawer/

This means all crawlers (user agent *) are allowed to read all files ending with css in core, e.g. https://ischool.uw.edu/core/main.css. But they are not allowed to read anything from drawer, e.g. https://ischool.uw.edu/drawer/schedule.html. There are various simple introductions to robots.txt, see for instance moz.com.

A related issue is legality. You should only scrape websites and services where it is legal. But in recent years it is getting more and more common for the sites to explicitly ban it. For instance, allrecipes.com states in Terms of Service that:

(e) you shall not use any manual or automated software, devices or
other processes (including but not limited to spiders, robots,
scrapers, crawlers, avatars, data mining tools or the like) to
"scrape" or download data from the Services ...

Some websites permit downloading for “personal non-commercial use”. GitHub states in its Acceptable Use Policies that

You may scrape the website for the following reasons:

* Researchers may scrape public, non-personal information from the
  Service for research purposes, only if any publications resulting
  from that research are open access.

* Archivists may scrape the Service for public data for archival purposes.

You may not scrape the Service for spamming purposes, including for
the purposes of selling User Personal Information (as defined in the
GitHub Privacy Statement), such as to recruiters, headhunters, and job
boards.

All use of data gathered through scraping must comply with the GitHub
Privacy Statement.

There is also a plethora of websites that do not mention downloading, robots and scraping. Scraping such pages is a legally gray area. Other websites that are concerned with what happens to the scraped data. Feasting at home states:

You may NOT republish my recipe(s). I want Feasting at Home to
remain the exclusive source for my recipes. Do not republish on
your blog, printed materials, website, email newsletter or even on
social media- always link back to the recipe. 

Again, my recipes are copyrighted material and may not be
published elsewhere.

While the legal issues may feel like a nuisance for a technology enthusiast, web scraping touches genuine questions about property rights, privacy, and free-riding. After all, many website creators have done a real effort and spent non-trivial resources to build and maintain the website. They may make the data available for browsers (not scrapers!) to support their business plan. But scrapers do not help with their business plan, and in some case may forward the data to a competitor instead.

In certain cases it also rises questions of privacy. Scraping even somewhat personal data (say, public social media profiles) for a large number of people and connecting this data with other resources may be a privacy violation. If you do this for research purposes, you should store the scraped data in a secure location, and not attempt to identify the real people in data!

In conclusion, before starting your first scraping project, you should answer these questions:

Is it ethical to download and use the data for the purpose I have in mind?
What can I do to minimize burden to the service providers?
Is it legal?
How should I store and use data?

10.2 HTML Basics

This section introduces basics of HTML. If you are familiar with HTML then you can safely skip forward to Beautiful Soup section.

HTML (hypertext markup language) is a way to write text using tags and attributes to mark the text structure. HTML is the standard language of internet, by far the most webpages are written in html and hence one needs to understand some HTML in order to be able to process the web. Below we briefly discuss the most important structural elements from the web scraping point of view.

10.2.1 Tags, elements and attributes

HTML pages are made of HTML elements. Most elements denote parts of page structure, such as header, paragraph, or image. Elements are enclosed inside tags. In common cases there are two versions of the tags–beginning tag and end tag. For instance, the beginning tag for a paragraph of text is <p> and the corresponding end tag is </p>, so an html document may contain

<p>This is a paragraph of text.</p>

In this case the p-tags enclose a paragraph, so p-s are tags and the paragraph is the element.

Tags can be nested, i.e. an element, such as paragraph, can contain other elements. Here is an example of emphasized text (<em>) inside of a paragraph:

<p>This paragraph contains an <em>important</em> message.</p>

In this case p-tags enclose a paragraph element, and inside of this element we have another element, enclosed in em-tags. However, elements that are only partly overlapping are not allowed. This is not valid html:

<!-- invalid html! -->
<p>This paragraph contains an <em>important</p> message.</em>

Not all elements are enclosed between beginning and end tag. One such example is image, tagged by <img>. It has no closing tag, all the information is given inside of the image tag as attributes. An webpage may contain image using

<img src="https://www.nightlightdesigns.com/wp-content/dolphin.jpg"
     alt="A friendly dolphin">

This tag adds an image to the text. It has two attributes: src tells where on the web (or on the disk) is the actual image file, and alt gives an alternative text that is displayed in case the image cannot be displayed (e.g. the browser is a screen reader). Both of these attributes can be useful for webscraping too, e.g. if you are interested in not just text but also images, you may want to download the picture using the value of the src attribute. One the other hand, if you just want to learn about the image content, you can check the value of alt attribute.

Html elements can contain many different attributes. Two very important ones from scraping perspective are class and id. class is used extensively to define certain layout elements, and id to uniquely identify the elements. For instance, a html page may contain

<div id="first-choice">
  <p><span class="red-warning">Do not</span> jump into water!</p>
</div>

It is up to the webpage to define how exactly should browser render both div (vertical block of text) and span (horizontal block of text), but here you may look for these elements with given id and class. The document may contain many other span-s, but here we may want to catch a span with class “red-warning”. In a similar fashion, we may be interested in div that is identified as “first-choice”.

In practice it is often useful to try to narrow the page down to large blocks, often div-s with descriptive class names. E.g. if a webpage may contain <div class="recent-news"> then it hints that the sidebar with most recent news is probably contained in this div. You may want to isolate this div and analyze it more closely.

10.2.2 Overall structure

A valid html document contains the doctype declaration, followed by <html> tag (see the example below). Everything that is important from the scraping perspective is embedded in the html-element. Html-element in turn contains two elements: head and body. Head contains various header information, including the page title, stylesheets and other general information. Body contains all the text and other visual elements that the browser actually displays. So a minimalistic html-file might look like:

<!DOCTYPE html>
<html>
  <head>
    <title>Mad Monk Ji Gong</title>
  </head>
  <body>
    <h1>Ji Visits Buddha</h1>
    <p>Ji Gong went to visit Buddha</p>
  </body>
</html>

This tiny file demonstrates all the structure we have discussed so far:

The first declaration in the file is DOCTYPE entry.
All the content is embedded in the html-element.
The html-element contains head and body elements.
head includes element title, the title of the webpage. This is not what you see on the page, but browser may show it on the window or tab title bar, and it may use it as the file name when you download and save the page.
The standard structure ends with body. It is the most important part of the page, almost all of the page content is here. How exactly the body is set up differs between the webpages, there is no standard structure. In this example body-element contains two elements: h1, this is the top title that is actually rendered on screen (typically big and bold), and a paragraph of text.

Because HTML elements are nested inside each other, we can depict an HTML page as a tree. The html element is the trunk that branches into two, head and body; and body in turn branches into other elements. Thinking about the page as a tree is a extremely useful way when designing code to navigate it.

10.2.3 Important tags

Below is a list of some of the tags that are important from the webscraping perspective. It is not necessary to know the exact definition of tags, and often you can have good enough guess by just comparing the html code with the visual layout. But it definitely helps to be familiar with the most important tags.

h1, h2, h3, … these are headers with h1 being the top-level header, h2 is a subsection header and so on. Headers are typically rendered big and bold with h1 being the biggest and boldest, and all lower-level headers successively less emphasized.
p is a paragraph of text
a is “anchor”, hypertext link. The link address is embedded in the href attribute, the link text (where you click) is between beginning tag <a> and end tag </a> tags. An anchor element may look like

<a href="www.example.com">click here</a>

ol and ul or “ordered list” (numbered list) and “unordered list” (bullet list without numbers).
li is the list item, a line inside of ol or ul list. Here is an example of ordered list:

<ol>
  <li>Wake up</li>
  <li>Snooze your alarm</li>
  <li>Fall asleep again</li>
</ol>

div is a vertical block of text that typically contains multiple lines. These do not have any default visual layout but they are often very important part of page structure. They tend to have attributes like class or id.
span is a horizontal block of text (in a single line). As in case of div, these do not have default visual properties but are important elements in page structure.

10.2.4 Html tables

A lot of data on html pages is in form of tables, marked with tag table. Webscrapers may also have special functionality to automatically convert html tables to data frames. The basic tables are fairly simple. The table contains thead element with column names and tbody element with the values. Each row in the table is tr element, each header cell (column names) is th element, and each value cell is td element. Here is a simple table for course grading system:

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>How many</th>
      <th>pt each</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Assignments</td>
      <td>8</td>
      <td>9</td>
    </tr>
    <tr>
      <td>Labs</td>
      <td>8</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

This table contains three colums and three rows, one of which is the header. In the rendered form it looks something like this (but it depends on the exact layout formatting):

Task	How many	pt each
Assignments	8	9
Labs	8	1

The default formatting is fairly unimpressive, more advanced webpages typically add custom formatting using CSS styles.

This basic introduction is enough to help you to understand the webpages from the scraping perspective. There are many good sources for further information, you may consider W3Schools for html and css tutorials, and W3C for detailed html specifications.

10.3 Web scraping in R and the rvest package

rvest is probably the most popular package for webscraping in R. It is a powerful and easy-to-use package but unfortunately its documentation is rather sparse. In essence it is an HTML parser with added functionality to search tags, classes and attributes, and move up and down in the HTML tree. Normally one scrapes web by downloading the pages from the internet (one can use read_html function) and thereafter parses the pages using rvest. The most complex step is to navigate the parsed html structure to locate the correct elements.

rvest library follows the common tidyverse style and works perfectly with magrittr pipes. We load it with

library(rvest)

rvest is relying heavily on another package, xml2 and although the main functionality is accessible without explicitly loading xml2 package, we may get some additional tools when we load that one. See removing elements below.

10.3.1 Example html file

In the following we demonstrate the usage of the library on the following html file:

<!DOCTYPE html>
<html>
  <head>
    <title>Mad Monk Ji Gong</title>
  </head>
  <body>
    <h1>Li Visits Buddha</h1>
    <p>This happened
      during <a href="https://en.wikipedia.org/wiki/Song_dynasty">Song Dynasty</a>.
      The patchwork robe made for <strong>Guang Liang</strong>...</p>
    <h2>Li Begs for a Son</h2>
    <p class="quote">When I wa strolling in the street,
      <footnote>They lived in Linan</footnote> almost
      everyone was calling me
      <span class="nickname">Virtuous Li</span> ...</p>
    <h2>Dynasties</h2>
    <div style="align:center;">
      <table>
        <thead>
          <tr>
            <th>Dynasty</th>
            <th>years</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>Song</td>
            <td>960--1279</td>
          </tr>
          <tr>
            <td>Yuan</td>
            <td>1271--1368</td>
          </tr>
        </tbody>
      </table>
    </div>
  </body>
</html>

This file includes a number of common html tags and attributes and a suitably rich structure for explaining the basics of web scraping. If you want to follow the examples, then you may copy the file from here and save it as “scrape-example.html”, or download it directly from the repo.

10.3.2 First part: download the webpage

A more serious web scraping project typically begins with finding the correct page and understanding its structure as the structure may not be obvious in case of more complex websites. However, we leave this task later, see Finding elements on webpage below. Instead, we immediately get hands dirty and download the page. Normally we do it using read_html function as

page <- read_html("https://www.example.com")

But this time we extract data from a local file so we use the file name, not the webpage url. read_html can read various inputs including the webpages and files, the entry is assumed to be a webpage if it begins with “http://” or a similar protocol marker. (Note: read_html is not part of rvest but a part of xml2 library but it is automatically loaded when you load rvest.)

page <- read_html("../files/scrape-example.html")

This loads the example file into a variable page. Now we are done with both downloading and parsing (or loading and parsing as we just loaded a local file) and we do not need internet any more. load_html also strips the website from its metadata and only returns the html part of the page. The result can be printed, but it is not designed for humans to be read:

page

## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n\t<h1>Li Visits Buddha</h1>\n\t<p>This happened\n\t  during <a hr ...

The resulting variable page is of class xml_node. It is structured as a tree and all our following tasks are about navigating the tree and extracting the right information from there.

10.3.3 The hard part: navigating the page and extracting data

The next, and typically the most complex part of web scraping is to navigate the html tree and extract data of interest. What makes it hard is the fact that each web page is different, and even more, the pages occasionally change their structure and hence the code that worked just yesterday may suddenly stop working. Modern webpages also tend to be rather complex, so just eyeballing the website source and trying to understand it, or even just trying to find the relevant parts of it may be challenging. The data we are interested may be stored in fairly good structured form, making it easy to be extracted. Or it may be stored in a lousily structured way, or in a completely unstructured way, which makes it quite hard to scrape.

Below, we list a number of navigation and data extraction methods in rvest and demonstrate their usage on the example webpage.

10.3.3.1 Elements and texts

A common way to navigate through the document is to locate the relevant elements (wrapped inside the corresponding <tag>-s). These can be extracted with html_element. For instance, let’s extract the head element from our simple webpage:

page %>%
   html_element("head")

## {html_node}
## <head>
## [1] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n
## [2] <title>Mad Monk Ji Gong</title>

This makes rvest to look through the parsed tree and return the first element with the requested tag <head>. As html pages normally contain just a single head, this is a safe way to get the head. Note that the head element hear also contains meta information that was not in the original file. And again–it is not a character string but an object of class xml_node.

If we want to extract the title only, we can add another html_element extractor that does just that–extracts element title from inside element head:

page %>%
   html_element("head") %>%
   html_element("title")

## {html_node}
## <title>

Remember: the pipe forwards the previous result to the following function, in this case the data that enters html_element("title") is the result of html_element("head"), just the head-element of the page, stripped of body of the webpage. The result tells us that the extracted node contains only one element, title. As above, the result is an xml_node, not the title text. If we are interested in the latter, we can extract it with html_text:

page %>%
   html_element("head") %>%
   html_element("title") %>%
   html_text()

## [1] "Mad Monk Ji Gong"

Now the result is the title text, returned as a character string.

But remember–html_element finds the first element of requested type. So we may actually skip the “head” part and just ask for “title”, and the first (and only) “title” will be found inside of “head”:

page %>%
   html_element("title") %>%
   html_text()

## [1] "Mad Monk Ji Gong"

This approach works well for title as a valid html document only has a single title. But if there is more than one element with this tag, html_element will only return the first one:

page %>%
   html_element("p") %>%
   html_text() %>%
   cat()

## This happened
##    during Song Dynasty.
##    The patchwork robe made for Guang Liang...

We added cat to make the result printed in a nicer way.

In order to extract all the p-elements, we need html_elements (plural) instead of html_element. This returns a list of all p-elements, and by feeding this list into html_text we convert all these into a character vector:

page %>%
   html_elements("p") %>%
   html_text() %>%
   paste(collapse="\n\n") %>%  # merge into a single string
   cat()  # and print in a nice way

## This happened
##    during Song Dynasty.
##    The patchwork robe made for Guang Liang...
## 
## When I was strolling in the street,
##    They lived in Linan almost
##    everyone was calling me
##    Virtuous Li ...

We merged the paragraphs with an empty line as a separator ("\n\n") and used cat for a nice printout.

10.3.3.2 Extracting using attributes

In the examples above we extracted just certain types of elements. But html_element allows much more powerful queries through two selectors. Its complete syntax is

html_element(node, css, xpath)

where css and xpath are respectively css- and xpath selectors. Above we used the simplest CSS-selectors, just element names but both permit much more complex navigation tasks. We demonstrate some of these below.

Some of the most useful html structure markers are class and id attributes. Class is used to denote the visual layout but often also just to denote the structure of the page. id is an attribute that adds a unique identifier to elements for navigation purposes, e.g. one can make each p or each div to have an unique id. These attributes are very useful for scraping too.

In CSS, class is marked as “.class”. So using css-selector we can extract the first object of class “quote” as “.quote”:

page %>%
   html_element(".quote")

## {html_node}
## <p class="quote">
## [1] <footnote>They lived in Linan</footnote>
## [2] <span class="nickname">Virtuous Li</span>

Or if we want a paragraph of class “quote”, we can specify it as “p.quote”:

page %>%
   html_element("p.quote")

## {html_node}
## <p class="quote">
## [1] <footnote>They lived in Linan</footnote>
## [2] <span class="nickname">Virtuous Li</span>

In this case these two methods are equivalent as the only object of class quote is a paragraph.

Extracting with id is similar, just id is marked with hash sign “#” instead of dot. For instance, “#123” will extract elements with id “123” and “p#123” will extract the paragraphs only with “123”. As the sample file does not contain any id-attributes, we do not provide example here.

Other attributes besides class and id can be accessed using “[attribute]” form for elements that have the attribute, or “[attribute=‘value’]” for elements that have the given value of the attribute. As an example, let’s extract all elements that have “href” attribute (here only the link anchor a):

page %>%
   html_element("[href]")

## {html_node}
## <a href="https://en.wikipedia.org/wiki/Song_dynasty">

And here is an example to use the generic attribute extractor for span of class “nickname”:

page %>%
   html_element("span[class='nickname']")

## {html_node}
## <span class="nickname">

There are much more in css selectors, see w3schools for additional explanations and examples. See also Navigating the tree below.

10.3.3.3 Extracting content

The extracted elements, although they may appear as text lines taken from the html file, are not text lines but parsed html trees. If we want to extract the text inside of the element as text, we can do this with the html_text (as we did above):

page %>%
   html_element("p.quote") %>%
   html_text() %>%
   cat()  # add 'cat' for nicer printing

## When I was strolling in the street,
##    They lived in Linan almost
##    everyone was calling me
##    Virtuous Li ...

Note that html_text also strips the text from its inner tags although it preserves line breaks and spacing. The result may not be what we want—here the footnote that now is completely unmarked makes the text hard to understand. See Removing elements below about how to remove elements, such as the footnote here.

Tables on html pages can be converted to R data frames using html_table. For instance, we can extract and parse the dynasty table inside the example file as

page %>%
   html_element("table") %>%
   html_table()

## # A tibble: 2 × 2
##   Dynasty years     
##   <chr>   <chr>     
## 1 Song    960--1279 
## 2 Yuan    1271--1368

This results in a data frame containing the dynasties and their ruling years. In case there are more then one table in the node that is fed to html_table, it returns a list of data frames instead. See more in tidyverse documentation.

Html attributes (such as “class” or “href”) can be extracted using html_attr, and list of all attributes can be found as html_attrs. Let’s extract the hyperlink, attribute “href” inside the “a” tag:

page %>%
   html_element("a") %>%
   html_attr("href")

## [1] "https://en.wikipedia.org/wiki/Song_dynasty"

Finally, the element name can be extracted using html_name:

page %>%
   html_element(".quote") %>%  # select an element of class 'quote'
   html_name()

## [1] "p"

We find that the element that is of class “quote” is “p”.

10.3.3.4 Navigating the tree

Sometimes is not enough to find elements inside other elements and we have to look at the tree structure in a more detail.

We can extract all children (elements nested directly in another element) with html_children. What are children of body?

page %>%
   html_element("body") %>%
   html_children() %>%
   html_name()

## [1] "h1"  "p"   "h2"  "p"   "h2"  "div"

The body has 6 children, the first one being the h1, the first-level header. But grandchildren–elements inside children–are not included.

As another example, we extract all children of the second paragraph “p”:

page %>%
   html_elements("p") %>%
   "[["(2) %>%  # extract the second element from the list
                # see also magrittr::extract2
   html_children() %>%
   html_name()

## [1] "footnote" "span"

Indeed, the second element contains two children, “footnote” and “span”.

10.3.3.5 Following links on the page

The standard element that contains link is “a” (for anchor). Its text is the link text that you can click on and it is often of blue color (see in Important tags above). The link address is given in its “href” attribute. So one may extract links as

link <- page %>%
   html_element("a") %>%
   html_attr("href")
link

## [1] "https://en.wikipedia.org/wiki/Song_dynasty"

And to read a new page, one can just use read_html(link). This is an easy and intuitive approach but one has to be careful with relative links, links that do not contain a complete website address but only relative links, the the end-parts of the actual address. For instance, <a href="/robots.txt"> on page “https://www.wikipedia.org” will refer to file “http://www.wikipedia.org/robots.txt”. One can easily use paste to combine the relative links with the original web address.

An alternative is to open the webpage not with read_html but with session (but this does not work for local files), and use follow_link. Here is an example that opens link labeled “Workflow:” in the R for Data Science online book, and displays the second paragraph downloaded from that link:

r4dsSession <- session("https://r4ds.had.co.nz/")
r4dsSession

## <session> https://r4ds.had.co.nz/
##   Status: 200
##   Type:   text/html; charset=utf-8
##   Size:   25397

welcome <- r4dsSession %>%
   session_follow_link("Workflow:")
welcome %>%
   html_elements("p") %>%
   "[["(2) %>%
   html_text()

## [1] "You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it’s both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying."

follow_link allows one to follow links based on (incomplete) link text. html_session returns an “session” object that can be handled in a similar fashion as parsed html trees, the ones that are returned by read_html.

10.3.3.6 Removing elements from the tree

Sometimes we may want to remove certain elements from the extracted content. Let us extract again the second “p” structure and its text from the document:

p2 <- page %>%
   html_elements("p") %>%
   "[["(2)
p2 %>%
   html_text() %>%
   cat()

## When I was strolling in the street,
##    They lived in Linan almost
##    everyone was calling me
##    Virtuous Li ...

The problem here is that the extracted text also includes the footnote that is placed out of context into the middle of the extracted text. In particular, Linan in the second line belongs to the footnote and the following almost belongs to the main text, but there is no way to tell this based on the printout. Sometimes it may be what we want, e.g. when we strip text from its attributes like “i” and “b” (for italic and bold respectively). But here it is clearly undesirable. As a solution, we can delete the “footnote” from the extracted p2 node using xml_remove. This function is not automatically loaded when we load rvest, so we have to load xml2 package separately:

library(xml2)
footnote <- p2 %>%
   html_element("footnote")  # locate the footnote
xml_remove(footnote)  # remove it from the original tree
p2 %>%
   html_text() %>%
   cat()

## When I was strolling in the street,
##     almost
##    everyone was calling me
##    Virtuous Li ...

Note that this removes the node from the original variable, page. This is a very un-R-ish way to do things. Normally R functions do not the modify objects they are working on, they return a new modified object instead. But xml_remove is an exception that modifies the original html tree!

So if you want to preserve the original document then you have to first make a copy of the problematic element. An explicit deep copy can be done with xml_new_root. This makes a new html document from the node you pass to it. As an example, let’s remove footnote from a copy of the paragraph, but not from the original tree:

## restore the original page
htmlPage <- read_html("../files/scrape-example.html")
## find the quote paragraph in the refreshed original tree
p2 <- htmlPage %>%
   html_element("p.quote")
## make copy of p2
p2copy <- xml_new_root(p2)
## locate the footnote in copy
footnote <- p2copy %>%
   html_element("footnote")
## remove it from the copy and print
xml_remove(footnote)
p2copy %>%
   html_text() %>%
   cat()

## When I was strolling in the street,
##     almost
##    everyone was calling me
##    Virtuous Li ...

Indeed, it is gone from the copy. But it is still there in the original paragraph:

p2 %>%
   html_text() %>%
   cat()

## When I was strolling in the street,
##    They lived in Linan almost
##    everyone was calling me
##    Virtuous Li ...

10.4 Finding elements on webpage

All the examples above were done using a very simple webpage. But modern website are often much more complex. How can one locate the elements of interest in the source?

For simple webpages one can just look at the page source in the browser, or open the html document in a text editor. If you understand the source, you can be translate it into a way to navigate and parse the html tree. But modern complex pages are often very complicated and hard to understand by just consulting the source.

Firefox element picker in action: when highlighting the menu item “Simulation” (upper left), the Inspector window below highlights the corresponding element in the source code. Screenshot from steampowered.com

For complex pages, the easiest approach is to use web browser’s developer tools. Modern browser, such as Firefox and Chrome, contain web developers’ tools. These can be invoked by Ctrl-Shift-I in both Firefox and Chromium (Linux and Windows) or Cmd-option-I (⌘ - ⌥ - I) (Mac⁶). These tools are excellent means to locate elements of interest on the page.

A particularly useful tool is element picker (labeled both in Firefox and Chromium) that lets you to point html elements in the browser window, and highlights the corresponding lines in the html source. In the figure at right one can see that the menu links are contained in an a element with class “gutter_item”. If we are interested in scraping the menu, we may try to locate such elements in the parsed html tree.

However, this approach only works if both browser and scraper actually download the same page. If browser shows you the javascript-enabled version targeted for your powerful new browser, but scraper gets a different version for non-javascript crawlers, one has to dive deeper into the developer tools and fool the website to send the crawler version to the browser too.

Finally, one can always just download the page, parse its html and walk through the various elements in a search for the data in question. But this is often tedious for even small pages.

It works on Safari but you have to explicitly allow developer tools in preferences.↩︎