Scraping for Journalism: A Guide for Collecting Data
Photo by Dan Nguyen/ProPublica
Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data.
Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Lilly has since released their data in PDF format.
These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you’re already an experienced programmer, you might learn about a new library or tool you haven’t tried yet.
If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites -- so you know what you’re asking for if you end up hiring someone to do the technical work for you.
The tools
With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source.
Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap.
Firebug – A Firefox plug-in that adds a host of useful development tools, including the tracking of parameters and files received from web sites you plan to scrape.
Ruby – The programming language we use the most at ProPublica.
Nokogiri – An essential Ruby library for scraping web pages.
Tesseract – Google’s optical character recognition (OCR) tool, useful for turning scanned text into “real,” interpretable text.
Adobe Acrobat – Can (sometimes) convert PDFs to well-structured HTML.
The guides assume some familiarity with your operating system's command-line (Windows, Mac)
A Guide to the Guides
Introduction: The Coder's Cause – Public records gathering as a programming challenge.
- Using Google Refine to Clean Messy Data – Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
- Reading Data from Flash Sites – Use Firefox's Firebug plugin to discover and capture raw data sent to your browser.
- Parsing PDFs – Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
- Scraping HTML – Write Ruby code to traverse a website and copy the data you need.
- Getting Text Out of an Image-only PDF – Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).

17 comments
Gavin
Jan. 3, 2011, 5:53 p.m.
Don’t forget scraperwiki.org :)
Josh
Jan. 3, 2011, 11:57 p.m.
Commercial software like Kapow is a good choice for non-programmers, if they have the scratch.
Earl F Glynn
Jan. 5, 2011, 8:57 p.m.
I find the R language (http://www.r-project.org/) useful for data scraping using the XML and RCurl packages, and it’s the best for data analysis, statistics, and chart creation. R is free but some programming experience is likely needed.
dominic
Jan. 6, 2011, 1:32 a.m.
Do you have any of this code on a github or other open source project?
Dan Nguyen
Jan. 6, 2011, 5:04 a.m.
Hi Dominic, each lesson has the relevant code in its entirety (usually near the end of the tutorial).
Kim Rees
Jan. 18, 2011, 1:08 a.m.
No mention of Needlebase? It’s the best tool out there for scraping data.
Kim Rees
Jan. 18, 2011, 1:09 a.m.
Oh, and for scraping PDFs, you definitely need Able2Extract.
Juan
Jan. 19, 2011, 6:11 a.m.
For scrapping complex logging-in websites and such, I recommend Fiddler for Windows (tool from Microsoft engineers that integrates IE or Firefox), far better than ‘Tamper data’ for Firefox (as an alternative too).
Dan Nguyen
ProPublica
Jan. 20, 2011, 8:42 p.m.
Thanks for all the suggestions folks, I hope to write a followup using some of the more ready-out-of-the-box scraping solutions…I understand not everyone has the time/need to put together their own, though learning some of the coding concepts will always be helpful
Jayakrishnan G
Jan. 21, 2011, 7:11 a.m.
You can use Zoho Sheet’s “Link External Data” feature to scrap data from external sources such as public web pages, RSS feeds and .csv data and subsequently analyze them in a spreadsheet.
Steve Figler
Jan. 28, 2011, 10:21 p.m.
It does my heart good to know you folks are out there doing what you do to bring transparency to the opaque world of modern politics and the dark, voracious lords of corporate demonhood.
They are so far ahead of our increasingly ineffectual leaders that I had begun to give up hope. Please keep up all the scraping and prying and revealing. Perhaps your mantra should be “Sunshine is the best disinfectant.”
Wayne Hynd
Feb. 20, 2011, 2 a.m.
Dan, it would be really great if you follow-up with the ready-out-of-the-box scraping solutions. I’m sure there are many of us who have useful contributions to make and good ideas, but cannot spend a year or two learning programming…and then really be just newbie amateurs! Much better to leave this to the pros and use our strengths elsewhere. Thanks a lot for bringing this technology to our attention!
Nicholas
May 6, 2011, 4:38 p.m.
A good book on scraping with focus on social media is “Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites” by Matthew A. Russell. It is focused on the programming language Python and requires a basic, but nowhere near intermediate, proficiency with the language. I own it myself and it’s a great reference.
It should complement this guide well.
Find it at such places as http://amzn.com/1449388345. (Not an affiliate link.)
Julia Hammond
May 15, 2011, 4:34 p.m.
If anyone likes scraping challenges, check out the release from Allergan.
It’s Flash but the structure is not there except ephemerally in cache. They used SOAP calls to a web service to get the search results. It can be scraped but with a lot of effort.
We tend to use Perl a lot, and then some of the tools Dan is talking about. It is a messy business for sure.
Bill
July 4, 2011, 6:37 p.m.
For Linux and Mac OS X users, a command line tool, ps2ascii, does a pretty good job of converting PDFs to ASCII text. I’ve had less luck converting Postscript files (inspite of the tool’s name).
It performs less well with tables and, obviously, not at all with embedded figures and images. I use it for earth science journal articles.Price is right—free.
Jon
March 15, 2012, 6:58 p.m.
http://www.crummy.com/software/BeautifulSoup/
Python V2 and V3 versions.
I’ve used V3 for a few years, and am moving up to V4 which runs on Python 3.
Very easy to use, includes documentation, examples.
Anna
Feb. 3, 8:14 a.m.
Or you can drag over a webpage’s table text, Copy it, go to your Google Drive page (aka docs.google.com), Create a new spreadsheet, then Paste into it.