ProPublica

Journalism in the Public Interest

Cancel

The ProPublica Nerd Blog

The Coder’s Cause in “Dollars for Docs”

.

Photo by Dan Nguyen/ProPublica

« Return to Scraping for Journalism

Our investigation of the financial ties between drug companies and doctors, Dollars for Docs, was sparked by a computational challenge. Several drug companies had been ordered to disclose who they paid to speak and consult on their behalf. But they made the records hard to analyze, seemingly making the data “impossible to download.”

We wanted to change that.

When Pfizer posted an online database of its doctor payments earlier this year, I wrote a tutorial for aspiring journalist-programmers on how to write software to collect data from sites like Pfizer’s, which was difficult to use beyond looking up individual names.

Pfizer’s doctor payments, which they were ordered to disclose, are an example of public, but not yet fully useful, information.

My ProPublica colleagues Charles Ornstein and Tracy Weber proposed that we collect the payments from the other companies that had made similar disclosures to see what we could learn from the aggregated data.

Pfizer's website
Pfizer's disclosure website

As far as we know, there had not yet been a freely available database of all of the disclosed doctor payments. The federal government will create one in 2013, one of the lesser-known mandates of the health care reform bill passed this year. But until then, the companies haven’t made it easy to comb through their disclosures.

The headline on an April 12 New York Times article sums it up: “Data on Fees to Doctors Is Called Hard to Parse.”

Eli Lilly’s initial decision to issue its report as an Adobe Flash presentation was particularly aggravating from a transparency perspective:

Carole Puls, a Lilly spokeswoman, said the company purposely made its report impossible to download "to protect the integrity of the data." Lilly was concerned someone could change numbers and create a false report outside the company’s Web site, Ms. Puls said.

Lilly had trumpeted the fact that it was both the first major pharma company to support the federal payments disclosure law and set to be the first to release a registry of its physician payments online, though the latter was a condition of their settlement of a $1.4 billion off-label marketing lawsuit).

Lilly's website
Pfizer's disclosure website

It didn’t seem right for companies to tout their commitment to transparency while making their records cumbersome to all but the most cursory of examinations. So we decided to create a truly transparent database to aid our investigation and so that readers could use to see what, if any, payments were made to their doctors.

Scrape and Share

There is no data on the Internet that is actually impossible to download. PharmaShine, a company that has created a business around collecting and selling access to the doctor payments data, told the Times that they manually retyped Lilly’s records.

We decided to write some code to automatically copy Lilly’s online data, which probably wasn’t much faster than just hiring a manual transcription service like Amazon’s Mechanical Turk. But it was an interesting challenge, the solution to which we could reuse and share for everyone else who comes across difficult-to-parse documents.

Lilly ended up releasing its list as a downloadable PDF and so we focused our efforts on examining the combined payments list and cross-checking it with federal and state records.

We shared the data with nearly a dozen newsrooms, who partnered with us to examine the companies' policies and standards for the doctors they hired to promote their products. Our latest story focused on medical schools unaware of faculty members violating their conflict-of-interest policies, a testament to how even esteemed academic institutions were stymied by scattered and inaccessible payments data.

Building a database-backed site was one of our main goals of this project, both so that readers could freely search it, and to start a conversation about what might go into the federal database scheduled for 2013. Since our database's launch in October, readers have searched for payments to their doctors more than a million times. Dozens of news outlets have used our data to do investigations of their own.

We also want to share how we gathered the data from a variety of formats, because the methods apply to many other kinds of public records plagued by digital hurdles and inscrutable formats.

If you’re interested in using the Dollars for Docs data, you can contact us here. We hope that these guides are useful for our fellow journalists and researchers as they tackle other sources of data.

The Dollars for Docs Data Guides

Introduction: The Coder's Cause – Public records gathering as a programming challenge.

  1. Using Google Refine to Clean Messy Data – Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
  2. Reading Data from Flash Sites – Use Firefox's Firebug plugin to discover and capture raw data sent to your browser.
  3. Parsing PDFs – Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
  4. Scraping HTML – Write Ruby code to traverse a website and copy the data you need.
  5. Getting Text Out of an Image-only PDF – Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).

Add a comment

Email me when someone responds to this article.