ProPublica

Journalism in the Public Interest

Cancel

The ProPublica Nerd Blog

Chapter 5: Getting Text Out of an Image-Only PDF

« Return to Scraping for Journalism

In the previous guide, we describe several methods for turning PDFs into data usable for spreadsheets. However, those only handle PDFs that have actual text embedded within them. When a PDF contains just images of text, as they do in scanned documents, then the problem isn't just how to convert them into neat tabular data, but how to extract any text, period.

In this tutorial, we'll explain how to write a program to extract the data into tabular format. Here's an overview of the basic steps:

1. Determine the positions of the lines that divide the rows and columns on a page.

Step 1

2. Break the image apart along those lines to create (hundreds of) individual image files, one for each cell.

Step 2

3. Perform optical character recognition on each cell to translate the image into a textfile.

Step 3

4. Reassemble these (hundreds of) text files in the same order that you divided the main image, creating a (text) spreadsheet of the data.

Step 4

A caveat: The code examples provided here are specific to the FlashPaper version of Eli Lilly's doctor payment disclosures, such as the black outlines of its table cells.

Background

Eli Lilly became the first major drug company to post its physician payments online in July 2009. However, Lilly was criticized in an April 2010 New York Times article for using a proprietary and discontinued format -- Adobe's "FlashPaper" -- which made the data virtually impossible to download or copy. In fact, PharmaShine, a company that maintains a commercial database of physician payments, said it had to manually retype the entire list.

Eli Lilly disagrees with the Times' story's characterization that it had "purposely made its report impossible to download." In an e-mail to ProPublica, Lilly spokesman J. Scott MacGregor called the characterization "misleading, as it was never our intention to make it difficult for people to access information" and said that preserving integrity of the data was the reason for not making it downloadable initially.

In any event, Lilly now provides their data as a PDF, with copyable text. However, when we began work on Dollars for Docs, Lilly was only providing data in the FlashPaper format. We came up a way to download the file and programmatically to extract tabular data from it.

Software to Get

Downloading the FlashPaper Document

Visit the Lilly faculty registry and observe your browser's traffic through Firebug (see my previous tutorial). You should see a file called data.swf with a size of more than 1MB.

The Safari browser's Activity window also will reveal the files downloaded to your browser; data.swf is what we want.

Download the file and open it in your web browser. There's a printer icon in the top-right. Click on it and choose PDF as the output. You'll end up with a PDF weighing over 200MB. When you open it up, you'll see that it appears to be a normal table of text. But you won't be able to highlight-and-copy (which is also the case for secure PDFs), and saving as text will create an empty file.

So let's convert it to the TIFF image format, the only format Tesseract can read. You can do this either with Adobe Acrobat Pro's Export function, or my colleague Jeff Larson's pdf-splitter:

Click the printer icon, circled in red, and print the file as a PDF.
pdf-splitter data.pdf.pdf '%d.tif' 2500

This will result in .tif files for every page in the original PDF.

Step 1: Reading Lines With RMagick

RMagick is a library for Ruby that allows us do a variety of graphics operations in our program, such as changing an image's color. In fact, that's the first thing we want to do because Tesseract works best with black-and-white images. This is done using RMagick's quantize method. We also can reduce the gray boosting the image's contrast.

We had limited success doing this programmatically and so just used a Photoshop batch operation to get the desired black-and-white contrast. For the purposes of this tutorial, you can use this sample black-and-white excerpt: table-to-ocr.tif

	require 'rubygems'
	require 'rmagick'

	img = Magick::Image::read('1.tif').first

	# Your mileage will vary with this following method; we ended up just using a Photoshop batch job to correctly produce a 
	# black-and-white version
	img = img.sigmoidal_contrast_channel(5,40,true).quantize(16, Magick::GRAYColorspace).posterize(5)

	# write out the result just to see what it looks like
	img.write('bw-1.tif'){|f| f.depth = 8}
	# the block sets the saved image to a depth of 8-bits

Zooming in reveals that the black lines are also composed of gray pixels. Detecting the lines in the .tif file simply involves finding the lines in which all the pixels are non-white (some may be gray, which you'll see if you zoom in at the pixel level). This can be done with using RMagick's get_pixels method on every horizontal and vertical line; get_pixels returns an array of pixels within the boundaries we specify.

RMagick's Pixel class has red, blue, and green attributes. Examining the red, blue, and green values of white pixel should give you 65535 for each; a black pixel will return the value 0. Gray pixels are anywhere in between.

So first we crop the image to remove the white space surrounding the table (using bounding_box). Then we examine each pixel of a line and record the positions where every pixel in that line had color values less than a dark gray (63000 seems to be enough tolerance):

box = img.bounding_box
img.crop!(box.x, box.y, box.width, box.height)

line_rows = (0..img.rows-1).inject([]){ |arr, line_index|
	test_length = (img.columns*0.10).floor
	arr << line_index if img.get_pixels(0, line_index, (test_length), 1).select{|pixel| 
		pixel.red < 63000 }.length >= test_length*0.95 
	arr
}

# same for the column lines
# line_cols = ...

# Some lines are wider than one pixel. The following call on line_cols and line_rows compresses it:

# do the following code for rows, i.e., rows==line_rows
columns = line_cols[1..-1].inject( [[ (line_cols[0]), (line_cols[0] )]]){|arr, line|
 if line == arr.last[1]+1
   arr.last[1] = line 
 else
   arr.push << [line,line] 
 end
 arr
}


# Which returns this array of columns, each column defined by a two-element array with the start and end, width-wise, of each line. Example:
# => [[0, 1], [211, 212], [420, 421], [630, 631], [747, 747], [900, 901], [942, 942], [1115, 1116], [1156, 1157], [1301, 1301], [1341, 1342], [1500, 1500]]
    

Step 2: Breaking It Into Pieces

Now that we have two arrays defining rows and columns, we iterate through each one and call RMagick's constitute method, which creates new images based on the dimensions we provided it. We then write each image to a file named column_numberxrow_number.tif:

Dir.mkdir('cell-files') if !File.exists?('cell-files')

rows[0..-2].each_with_index do |row, i|
 new_row = []
 columns[0..-2].each_with_index do |column, j|
    x,y= column[1], row[1]
    w,h= columns[j+1][0]-x, rows[i+1][0]-y
    Magick::Image.constitute(w, h, "RGB", img.get_pixels(x,y,w,h).map{ |pixel| 
       [pixel.red, pixel.green, pixel.blue]}.flatten).write("cell-files/#{j}x#{i}.tif"){|out| out.depth=8}
	`tesseract cell-files/#{j}x#{i}.tif cell-files/#{j}x#{i}.txt `
 end
end

You should end up with a directory called cell-files with nearly 500 TIFF files in it.

Step 3: Tesseract Each Image

Tesseract is a free optical character recognition program, first developed by HP and now maintained as open-source software by Google. Its operation is simple: point it to an image file, and it produces a text file with what it interprets as text from that image. So, in the above code, we simply run Tesseract on each TIFF as it is created. Add this to the above code, after the constitute call:

`tesseract /cell-files/#{j}x#{i}.tif /cell-files/#{j}x#{i}.txt `

Now you should have nearly 500 text files in cell-files.

Step 4: All Together Now

In the previous code, we're essentially stepping through the image column by column, line by line. While we're in this loop, we might as well record each of the text files' content into one master text file. If we add a delimiting character each time, we end up with tabular data.

So, in the previous block of code, open a text file called "1-table.txt" and after the tesseract call, write the contents of that tiny text file into "1-table.txt." The combined code for Steps 2-4 is:

	require 'rubygems'

	require 'rmagick'

	img = Magick::Image::read('1.tif').first

	box = img.bounding_box
	img.crop!(box.x, box.y, box.width, box.height)

	line_rows = (0..img.rows-1).inject([]){ |arr, line_index|
		test_length = (img.columns*0.10).floor
		arr << line_index if img.get_pixels(0, line_index, (test_length), 1).select{|pixel| 
			pixel.red < 63000 }.length >= test_length*0.95 
		arr
	}


	# same for the columns

	line_cols = (0..img.columns-1).inject([]){ |arr, line_index|
	test_length = (img.rows*0.10).floor
	arr << line_index if img.get_pixels(line_index, 0, 1, (test_length)).select{|pixel| pixel.red < 63000 }.length >= test_length*0.95 
	arr
	}


	columns = line_cols[1..-1].inject( [[ (line_cols[0]), (line_cols[0] )]]){|arr, line|
	 if line == arr.last[1]+1
	   arr.last[1] = line 
	 else
	   arr.push << [line,line] 
	 end
	 arr
	}


	rows = line_rows[1..-1].inject( [[ (line_rows[0]), (line_rows[0] )]]){|arr, line|
	 if line == arr.last[1]+1
	   arr.last[1] = line 
	 else
	   arr.push << [line,line] 
	 end
	 arr
	}


	Dir.mkdir('cell-files') if !File.exists?('cell-files')

	output_file = File.open('1-table.txt', 'w')

	rows[0..-2].each_with_index do |row, i|
	  text_row = []
	    columns[0..-2].each_with_index do |column, j|
	      x,y= column[1], row[1]
	      w,h= columns[j+1][0]-x, rows[i+1][0]-y
	      Magick::Image.constitute(w, h, "RGB", img.get_pixels(x,y,w,h).map{ |pixel| 
	          [pixel.red, pixel.green, pixel.blue]}.flatten).write("cell-files/#{j}x#{i}.tif"){|out| out.depth=8
	      }
	      `tesseract cell-files/#{j}x#{i}.tif cell-files/#{j}x#{i} `

	       # open the file Tesseract just created and store the text into an array
	       text_row << File.open("cell-files/#{j}x#{i}.txt", 'r').readlines.map{|line| line.strip}.join(" ")

	     end

	     # join the array with tab characters and then output it as a line
	     output_file.puts( text_row.join("\t"))
	end
	output_file.close

Cleanup Time

If that seemed too easy, it was. Open the resulting text file in a spreadsheet:

Tesseract's imperfect translation of the images we sent it.

Tesseract isn't perfect, and on the first pass it may mistranslate many characters, especially ones that look similar to another, such as 'O' and '0' (zero). You can train the Tesseract engine, though it's tedious and involves giving it a character-by-character correction of a sample test.

Even then it won't be perfect. Your next step will be to determine the best way to clean this data. You could start by validating each column against what you know it should contain (such as a currency format). Or you could design a Mechanical Turk task in which you send an individual text file and TIFF for each cell and ask workers to perform a simple verification. You could even write your own Rails application to display the images and text-values side by side, so that your co-workers can collaboratively do the verification (this is what we did until Lilly released their data as PDFs).

Again, the code here is specific to Lilly's format and may not be as successful on a scanned document where, for example, the lines aren't as easy to determine.

We hope to craft a more generalized version of this guide in the near future. We'll call the project "Tableract" for now.

The Dollars for Docs Data Guides

Introduction: The Coder's Cause – Public records gathering as a programming challenge.

  1. Using Google Refine to Clean Messy Data – Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
  2. Reading Data from Flash Sites – Use Firefox's Firebug plugin to discover and capture raw data sent to your browser.
  3. Parsing PDFs – Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
  4. Scraping HTML – Write Ruby code to traverse a website and copy the data you need.
  5. Getting Text Out of an Image-only PDF – Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).

Nice guide. All of what you describe can be done with GNU/Linux (e.g., Debian, Ubuntu, Fedora, etc.), oftentimes using the same or similar tools. You might want to add Unpaper, which is a nice command line tool for cleaning up images of scanned text before running Tesseract.

Another possible approach is to use Google Docs. You can upload either plain image files or images contained in PDF files and Google Docs will run OCR on them and return a text document.

Andrew Evans

Sep. 18, 3:15 p.m.

In splitting apart the pdf, Harris Corner recognition is great as well. Open CV can perform this. Box corners usually are easy to segment by finding the corners, and following the ‘line’ from them or finding basic stats. Clustering works well with the stats.

Good article. Thanks

Kurt Pfeifle

Sep. 29, 4:37 p.m.

You wrote about your usage of ‘PDF-Splitter’:

> My colleague Jeff Larson’s command-line utility to split a PDF into
> the TIFF image format (requires Mac OS X Snow Leopard). This can
> be done (though not as easily) with Adobe Acrobat Pro.

I’m wondering, really wondering: your organisation seems to be extremely tech-savvy in doing what it does. I applaud all your efforts!

However, you do not seem to ever have heard of *Ghostscript*?! Ghostscript does not only run on ‘Mac OS X Snow Leopard’, but on all OSX versions. And on all Windows versions. And on all Linux versions. And on all Unix versions….

It is easy to convert any PDF into a TIFF (G4, or colored TIFFs if you want). For the case of TIFF G4, just run:

  gs -o myfile-page-d.tif -sDEVICE=tiffg4 myfile.pdf

Each page of the PDF will be a separate TIFF, named `myfile-page-001.tif`, `myfile-page-002.tif`, etc.

Should you want a multi-page TIFF instead, just use as output name `myfile.tif` ...

Kurt Pfeifle

Sep. 29, 4:45 p.m.

I forgot: Ghostscript by default uses a resolution of 72 ppi for all image output formats (though for TIFF G4 the default may be 204x196 ppi—not sure without checking…).

For post-processing the TIFF images in Tesseract, 200 ppi or 300 ppi are probably a better parameter.

This can be taken care of on the commandline by adding

    -r300

to its parameter set.

Add a comment

Email me when someone responds to this article.