In the previous guide, we describe several methods for turning PDFs into data usable for spreadsheets. However, those only handle PDFs that have actual text embedded within them. When a PDF contains just images of text, as they do in scanned documents, then the problem isn't just how to convert them into neat tabular data, but how to extract any text, period.
When the Consumer Products Safety Commission provided data in October, the agency said it had received fewer than 3,500 reports of tainted drywall. ProPublica and the Sarasota Herald-Tribune compiled a list of addresses from county property appraiser data and records in consolidated lawsuits filed in New Orleans federal court and found nearly twice that number: around 6,900 homes.
Wednesday the Federal Reserve released data on more than 21,000 loans and other deals it made through a dozen emergency programs created during the financial crisis. We’ve combined the Fed’s three programs that loaned directly to banks and other financial firms with the goal of getting them to start lending again.
As part of ProPublica’s “Dollars for Docs” series and interactive news application, we've created a small widget that you can embed on your web site. It will let your readers look up whether their health care providers are taking money from the drug companies in our database. The widget shows the amount of money paid to each practitioner in our database, which company made the payment, and in some cases, what the companies saidthey were paying for: speaking fees, consulting, etc. The widget also lists what drugs each company sells so readers can check their own prescriptions.
On Oct. 8, we published an investigation examining how a judicial opinion in a pivotal lawsuit brought by a Guantanamo detainee vanished, only to be replaced weeks later by an entirely different opinion. At the center of our reporting are two documents representing separate versions of that same opinion: the original opinion written by Judge Henry H. Kennedy, and a second opinion quietly put in the original's place more than a month later.
Why are there two opinions? As reporter Dafna Linzer explains, redactions that were supposed to be made in the original opinion never were. Once government security officials, who are responsible for reviewing and redacting classified information from sensitive cases, discovered the error, the decision was quickly removed from the court file. In Judge Kennedy’s courtroom four days later, the Justice Department refused to have the opinion redacted and re-released. With the detainee, Uthman Abdul Rahim Mohammed Uthman, slated for indefinite detention, the stakes were high. Officials did not want to risk that those who had seen the original opinion would know exactly what the government had meant to keep classified.
On Wednesday, we launched an interactive news application to help readers understand the cross-owned nature of Collateralized Debt Obligations (CDOs) in 2006-2007. This cross-ownership helped inflate the bubble, and ultimately made the financial crisis worse.
We received a list of cross-owned CDOs as a result of a study ProPublica commissioned from Thetica, a consulting company in New York. It consisted of a list of CDOs, the banks that sponsored them, the CDO managers who managed them, and an enumerated list of other CDOs in which it had both sold and bought a stake. Reporters Jake Bernstein and Jesse Eisinger had already used the data in their story, Banks’ Self-Dealing Super-Charged Financial Crisis.
See which CDOs exchanged pieces with other CDOs through our interactive feature that reveals the incestuous nature of Wall Street’s CDO business.
Since the day we launched, ProPublica has encouraged people to republish our stories for free. We even license our stories under Creative Commons (CC). However, in the past we've had trouble knowing precisely which stories had been republished where, and we had no way of knowing how many people were reading our stories on sites that republished them under our CC license.
Shortly after the redesign of our site, we started working on a system that would help us solve this problem. When we found out that Jeremy Ashkenas, a developer at DocumentCloud, was working on a similar problem, we joined forces, and finished work on a lightweight stats tracker, which we are open sourcing today.
World, meet Pixel Ping.
Today we are introducing our Nerd Blog, a place to talk about what programmer-journalists at ProPublica are working on, announce newly-launched news applications, and to hear from technically-minded readers, as well as our fellow nerdy journalists. We’re going to be writing about each of our projects as we release them, and flagging open source tools we’ve found useful.
So what the heck is a “news application”? It’s an interactive web page that uses software instead of words and pictures to do journalism.
In the last two years of the boom, CDOs created by one bank commonly purchased slices of other CDOs created by the same bank.
Follow the damage claims from the Gulf Oil Spill paid by HP.
Our frequently updated database tracks every dollar and every bailout recipient. Check out our scorecard to see where the spending stands.
Using results from a questionnaire we did with American Public Media’s Public Insight Network, we examined how the proposed health care reforms will actually affect people facing common health care coverage situations.
How did Magnetar’s deals in subprime mortgage securities compare to the overall market’s?
Compare the Senate version of the 2010 Health Care overhaul bill with the final bill.
How easy does your state make it to investigate licensed nurses online?
ProPublica hosts newsroom developers -- or developers who want to see what it's like to work in news -- for 3-5 day job shadowing residencies called the ProPublica Pair Programming Project, or P5.
Use ProPublica's data -- cleaned, categorized and often created from multiple sources -- in your reporting and research.