Scraping for Journalism: A Guide for Collecting Data

A series of programming and technical guides on how we collected data for Dollars for Docs.

December 30, 2010, 12:23 pm

Our Dollars for Docs news application lets readers search pharmaceutical company payments to doctors. We’ve written a series of how-to guides explaining how we collected the data.

Most of the techniques are within the ability of the moderately experienced programmer. The most difficult-to-scrape site was actually a previous Adobe Flash incarnation of Eli Lilly’s disclosure site. Lilly has since released their data in PDF format.

These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you’re already an experienced programmer, you might learn about a new library or tool you haven’t tried yet.

If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites — so you know what you’re asking for if you end up hiring someone to do the technical work for you.

The tools

With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source.

Google Refine (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap.

Firebug – A Firefox plug-in that adds a host of useful development tools, including the tracking of parameters and files received from web sites you plan to scrape.

Ruby – The programming language we use the most at ProPublica.

Nokogiri – An essential Ruby library for scraping web pages.

Tesseract – Google’s optical character recognition (OCR) tool, useful for turning scanned text into “real,” interpretable text.

Adobe Acrobat – Can (sometimes) convert PDFs to well-structured HTML.

The guides assume some familiarity with your operating system’s command-line (Windows, Mac)

A Guide to the Guides

Introduction: The Coder’s Cause – Public records gathering as a programming challenge.

Using Google Refine to Clean Messy Data – Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
Reading Data from Flash Sites – Use Firefox’s Firebug plugin to discover and capture raw data sent to your browser.
Parsing PDFs – Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
Scraping HTML – Write Ruby code to traverse a website and copy the data you need.
Getting Text Out of an Image-only PDF – Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).

Scraping for Journalism: A Guide for Collecting Data

A series of programming and technical guides on how we collected data for Dollars for Docs.

Republish This Story for Free

The tools

A Guide to the Guides

Dan Nguyen

Most Read

Series: Dollars for Doctors: How Industry Money Reaches Physicians

The tools

A Guide to the Guides

What We’re Watching

Sharon Lerner

Andy Kroll

Melissa Sanchez

Jesse Coburn

Most Read

Chicago Promoted Two Police Officers After Investigators Found They Engaged in Sexual Misconduct

A Death Row Inmate Was Released on Bail After His Conviction Was Overturned. Louisiana Still Wants to Execute Him.

We Found That More Than 170 U.S. Citizens Have Been Held by Immigration Agents. They’ve Been Kicked, Dragged and Detained for Days.

UnitedHealthcare Tried to Deny Coverage to a Chronically Ill Patient. He Fought Back, Exposing the Insurer’s Inner Workings.

Powerful Friends: Sympathetic Officials and “Cultural Power” Help Ranchers Dodge Oversight

Journalism That Holds Power to Account