This post was co-published with Source.
As part of the orientation week for the 2014 class of Knight-Mozilla OpenNews Fellows, fellow nerd-cuber Mike Tigas and I led a hackathon at Mozilla’s headquarters in San Francisco the goal of which was to build a news application in two days. This is the story of that hackathon and the app we created, told mostly from the perspective of the Fellows who participated.
The six Fellows, who will be embedded for a year in newsrooms around the world, come from a variety of backgrounds. Most had some familiarity with web development. Few had worked in news before. They had also never previously worked together. After the two-day event, the eight of us had finished the rough draft of a demonstration news app based on a data set on tire safety ratings from the National Highway Traffic Safety Administration. Using the app, readers can look up their car’s tires to see how their temperature, traction and treadwear ratings compare to other brands.
While the Fellows had never built a news app, they quickly picked up on the process. By the end of the first day of working as a group, we had almost finished cleaning the data set and importing it into a SQLite database and were ready to delve into user interface work. By the end of the second day, the app was fully searchable and was looking great.
So, how do you make a news app, and how did we make a simple one in two days?
On the surface, you build a news apps a lot like you build other kinds of web apps. They’re usually built around a relational database, using a framework like Django or Rails. They allow users to drive some kind of interaction (and sometimes input stuff that stays resident in the app) and they’re usually accessed on a variety of browsers and platforms.
One key difference: While most web apps are created to be containers for data, news apps are both the container and the data. The developer who makes the app is usually deeply involved in analyzing and preparing the data, and every app is closely tied to a particular data set.
At ProPublica we consciously design news applications to let readers tell themselves stories. Our tools include words and pictures and also interaction design and information architecture. News apps help users (really readers) find their own stories inside big national trends. They can help spur real-world impact by creating a more personal connection with the material than even the most perfectly chosen anecdote in a narrative story can.
We went through each of these four steps to build Tire Tracker.
- Understanding and acquiring data
- Cleaning and bulletproofing data
- Importing data into a database
- Designing and building the public-facing app
Understanding and Acquiring Data
As I said, most news apps enter the world with some or all of the data already in place — typically data that journalists have cleaned and analyzed. In fact, most of the time spent making a news app is usually dedicated to hand-crafting the data set so that it’s cleaned and highly accurate. This is not unlike the process of reporting a story — writing a story typically takes a lot less time than reporting it does.
There are many ways for journalists to acquire and prepare data sets for a news app. One of the easiest ways to obtain data is to find a dataset in a public repository such as data.gov, or get it through an API such as the New York Times Campaign Finance API. In some cases, the data arrives clean, complete and ready to be used, and you’ll be making a news app in no time.
However, acquiring data is not usually that easy. If the data you want is not in a public repository, you can request it from a government agency or company, sometimes through a Freedom of Information request. If the data is available as web pages but not as download able, structured data, you can try scraping it. If it’s a PDF you can try using a tool like Tabula to transform it into a format good for pulling into your database.
Most data sets, in our experience, come with quality problems. Really big, complex data sets often come with really big, complex problems. Because our hackathon was only two days and we didn’t want to spend those two days cleaning data, we picked a data set we knew to be relatively small and clean: NHTSA’s Tire Quality Dataset.
We found a version of the data from 2009 on the data.gov portal, but wanted to work with more recent data, so we called NHTSA, which provided us with an updated version as a CSV file. NHTSA also has tire complaint data, which we downloaded from their website.
Fellow Gabriela Rodriguez worked on researching and obtaining data for the app:
One of the things I worked on was researching possible data sets related to tires and their ratings. It would have been interesting to have data on tire prices and relate them to these ratings, but we couldn't find anything online that was free to use. We also looked into cars — their prices and the tires that came installed on them. Nothing. It’s only logical that this data would be published somewhere, but we couldn’t find it. Collecting some of this data ourselves would have taken more than a day, so we worked with what we had: Tires, ratings, complaints and recalls.
Because this was a demonstration app built in a two day hackathon and not a full-fledged news app, we relied on publicly available information about the data set and a few brief conversations with NHTSA and a tire expert to understand the data, and didn’t do as much reporting as we might normally have done. NHTSA’s data set was small, clean and easy to work with and understand. It got complex when we tried to join the data on other NHTSA datasets like complaints and recalls. The agency did not have unique IDs for each tire, so we wrote algorithms in an attempt to join them by brand name and tire model name.
Cleaning and Bulletproofing
Before you import your data, you need to clean and bulletproof it.
Even if at first glance your data looks good, you never know what problems lurk within until you examine it more closely. There are lots of reasons data can be dirty — maybe whoever assembled it ran up against the limits of some software program, which simply chopped off rows, or maybe they were “helped” by autocorrect, or maybe the software they used to send you the data eliminated leading zeros from ZIP Codes it thought were supposed to be integers. Maybe some values were left blank when they should be “meaningful nulls.”
Effectively cleaning data requires a solid understanding of what each column means. Sometimes this means reading the material the agency publishes about the data. Often this means calling somebody at the agency and asking a barrage of very nerdy questions.
Fellows Harlo Homes and Ben Chartoff worked on cleaning data.
Chartoff worked on making the grading data more usable, especially the complicated “size” column:
I spent most of my time in the bowels of the data, cleaning and parsing the "tire size" field. It turns out a tire's size can be broken down into component parts — diameter, load capacity, cross sectional width, etc. This means that we could break a single size down into eight separate columns, each representing a different value. That's great — it leads to more specificity in the data — but the tire size field in our source data needed a lot of cleaning. While many of the entries in the size field were clean and complete (something like "P225/50R16 91S" is a clean single tire size), many were incomplete or irregular. A size field might just list "16", for example, or "P225 50-60". After spending a while with the data, and on a few tire websites, I was able to parse out what these entries meant. The "16" refers to a 16 inch diameter, with the rest of the fields unknown. It's included in the tire size at the end of the first string, e.g. P225/50R16. P225 50-60, on the other hand refers to three different sets of tires: P225/50?, P225/55?, and P225/60? where the ?'s represent unknown fields. I ended up writing a series of regular expressions to parse sizes in different formats, breaking each entry down into anywhere from one to eight component parts which were each stored separately in the final database.
Holmes worked on joining tire rating data to complaint data, which was tricky because there was no common key between them. She implemented algorithms to find similarities between brand names and tire model names to guess which complaints went with which tires:
I wrote a simple script that fixed these inconsistencies by evaluating the fuzzy (string) distance between the ideal labels in our first data set and the messier labels in our incident report sets. In my initial implementation, I was able to associate the messy labels with their neater counterparts with almost 90 percent accuracy. (I didn't do any real, official benchmarking. It was a hackathon — who has the time?!)
This initial success proved that using fuzzy distances to standardize entity labels was the best way to go. However, certain specific qualities about our data set complicated the algorithm a bit. For example, some manufacturers have multiple lines of a particular product (like “Firestone GTX” and “Firestone GTA”) and so our algorithm had to be adjusted slightly to further scrutinize any entry that appeared to be part of a line of products made by the same manufacturer. To tackle this, I wrote another algorithm that parsed out different versions of a product where appropriate. Once this second layer of scrutiny was applied to our algorithm, the accuracy jumped significantly, and we eliminated all false positive matches.
If this had been a full-bore news app, we would have taken a few weeks to spot check and optimize Harlo’s spectacular matching work, but seeing as how this was a two-day hackathon, we decided not to publish the complaint data. The chance of misleading readers through even one false positive wasn’t worth the risk.
For a complete guide on bulletproofing data, see Jennifer LaFleur’s excellent guide.
Importing Data into a Database
After you’ve obtained, understood and cleaned the data, you’ll create a database schema based on your dataset in your framework of choice, and perhaps write an importer script actually get the data into it. We usually use Rake to write tasks that will import data from CSV, JSON or XML into our database. This makes it easy to recreate the database in case we need to delete it and start again. Our Rake tasks typically lean on the built-in database routines in Rails. If you don't use Rails, your framework will have a vernacular way to import data.
Designing and Building Your App
We design our apps with a central metaphor of a “far” view and a “near” view. The “far” view is usually the front page of an app. It shows maximums and minimums, clusters and outliers, correlations and geographic trends. Its job is to give a reader context and then guide him or her into the inside of the app either via search or by browsing links.
Fellow Aurelia Moser worked on the front page “far” view for the tire app, a grid of how the top-selling tire brands are rated:
As part of the data visualization team, I was meant to tackle some graphical representation of tire grades according to the top brands in the industry. The objective here was to illustrate at a glance what the official tire grade distribution was for the top tire manufacturers. Our initial approach was going to be to create a chart view of the data, but because there were more than 253 brands and tire lines it seemed like the chart might be overwhelming and illegible. Taking 'top-ten' brand data from Tire Review, I built a little matrix in D3 to illustrate grade information by Tireline or Make (y-axis) and Brand (x-axis).
The “near view” is the page on an app that most closely relates to the reader. It could represent the reader’s school, his doctor, a hospital, etc. It lets readers see how the data relates to them personally.
Fellow Brian Jacobs worked on the design and user interface for the near-view pages, which, in this case, let readers drill down to a specific brand and its tires:
I tried a top-level comparative view, showing tire brands at a glance, sorted by averaged tire ratings, and re-sortable by the other quality ratings. Each brand would also show sales volume information if possible. You would then be able to dig deeper into a particular brand, where a tire "report card" view would display, showing a more granular visualization of the quality rating distribution. This would display above a full list of tire models with their respective ratings. So, users would be able to explore brands from top down, and also go directly to their model of choice with a search tool.
It took some time to realize, but it turned out that some weaknesses in the dataset prevented any responsible inclusion of any summary or aggregate data. Omitting specific top-level statistics is unfortunate, as it greatly reduces the ability for the general-audience user from gleaning any quick information from the app, without having a specific tire in mind and essentially eliminates our ability to highlight any patterns.
My colleague Mike Tigas built the search feature for the app:
I focused on implementing a search feature on top of our dataset. I used Tire, a Ruby client for ElasticSearch, because I've used it on previous projects and the library provides simple integration into ActiveRecord models. (ElasticSearch was chosen over a normal SQL-based full-text search since we wanted to provide a one-box fuzzy search over several of the text fields in our dataset.) Amusingly, a lot of my time was spent on code issues related to using a software library named "Tire" in the same app as a Ruby model named "Tire". (We later renamed our model to "Tyre", internally.)
And Fellow Marcos Vanetta did a little of everything:
Once you’ve loaded your data, designed and built an interface and spot-checked your app, you’ll want to deploy it. We use Amazon EC2 to host our apps. A year or so ago we published a guide that goes into all the nerdy details on how we host our work — as well as a second guide that explores other ways to host an app.
Our tire quality application was, of course, more of an exercise in learning how to write stories with software than a full-blown investigation. We didn’t delve deeply into tire brands or do complex statistics we might do if we were more seriously analyzing the data. Nevertheless, the six 2014 OpenNews fellows started getting familiar with our approach to projects like these, and we can’t wait to see what else they come up with over the course of their newsroom year.