Journalism in the Public Interest

The ProPublica Nerd Blog

Upton: A Web Scraping Framework


Today we're announcing a new open-source project that aims to make web scraping simpler. It's named Upton, after labor journalist Upton Sinclair, because the project started as part of our intern investigation.

Upton is a web-scraping framework packaged as a RubyGem. It abstracts away some of the common parts of web scraping so that developers can concentrate on the unique parts of their project.

Web scraping is a common task in data journalism and news-application development. In every scraping project there are often at least two kinds of web pages we're interested in: One contains a bunch of links to other pages that contain data, and the other contains the data itself. Think about the relationship between a search result page on a job board and a job listing page itself, or the ProPublica homepage and ProPublica stories, or a page listing payments to doctors and a page describing a payment to a single doctor. Let's call these "index" pages and "instance" pages respectively. Upton understands this paradigm out of the box.

Upton also helps you be a good citizen: It helps you avoid hitting the site you're scraping with requests that are unnecessary because you've already downloaded a certain page. By default Upton stashes a copy of pages it's seen. If you have to re-run your script repeatedly -- for instance, if your code crashes and you need to restart, or you're troubleshooting a bug and are running the script as a test -- already-seen pages will be loaded from Upton's file cache. Also, requests are spaced 30 seconds apart by default.

It's super easy to scrape a site with Upton. For the most basic cases, all you need is a list of instance-page URLs (at least one), a CSS selector or XPath expression, and a target that's either a list of items or an HTML table.

u =[""])
u.scrape(&Upton::Utils.table("//table[2]", :xpath))
#=> [["Jeremy", "$8.00"], ["John Doe", "$15.00"]]

Upton can also scrape an index page for the list of instance pages, like this:

u ="", "a#story", :css)
u.scrape(&Upton::Utils.list("#comments li a.commenter-name", :css))
#=> ["Jeremy", "John Doe", "Scott Klein", "Jeremy"]

If you want to scrape directly into a CSV, Upton has a convenience method for that too. Just use the scrape_to_csv method instead of scrape and set the filename argument.

u ="", "a#story", :css)
u.scrape_to_csv("output.csv", &Upton::Utils.list("#comments a.commenter-name", :css))

Here's a real life example: Let's scrape the bylines from articles from the "INSIDE NYTIMES.COM" section of the New York Times front page, directly into a CSV.

u ="", "ul.headlinesOnly a", :css)
u.scrape_to_csv("output.csv", &Upton::Utils.list("h6.byline", :css))

Unfortunately, not all websites are this simple.

Rather than using the included list and table Procs, you can write your own block to parse instance pages however you wish. In this example, a custom block finds the title and size of each instance page and returns them in a hash.

u ="", "a#story", :css)
u.scrape do |article_html, article_url|
   output = {}
   page = Nokogiri::HTML(article_html).css("div#article")
   article_title = page.css("h1#title")
   article_size = page.css("div#article").text.size
   output[article_title] = article_size
#=> {"Explainer" => 1000, "Feature" => 5000}

If the list of instance URLs is coming from an API, you can subclass Upton::Scraper and override its get_index method, after which the scrape method will work as expected.

MyScraper < Upton::Scraper
def get_index
end"a#important", :css))

Or, if you have to log in to get past a paywall, you might override get_instance in a similar way. Upton includes other methods to deal with paginated indexes or instances.

To start using Upton, just add it to your Gemfile. More detailed instructions are available in the documentation.

I hope you find that Upton makes scraping just a little easier.

blog comments powered by Disqus