How To Edit 52,000 Stories at Once

Today we're launching an update to our Opportunity Gap news application, with two new data points, better design on smartphones and integration with the Foursquare social network.

Also launching with this new release are short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science, a startup based in Chicago. Narrative Science launched in 2010 out of a research project at Northwestern University. Their platform uses artificial intelligence to turn structured data into human-readable narratives. For example, Forbes uses their technology to turn quarterly corporate earnings reports into short narratives.

We were introduced to Narrative Science by the MacArthur Foundation, which has supported ProPublica since our inception in 2007. In our initial conversations with the team at Narrative Science, we realized that ProPublica's data team and Narrative Science share a common goal: To make data tell stories. We also had a hypothesis, which is that adding narratives to each school page would provide an easier way for people who learn verbally rather than visually to understand the data.

Here’s how it worked, and what you might expect if you’re a news organization looking to experiment with algorithmically generated stories.

The project started with a call with Narrative Science to talk about what was important in the data. We told them that the “nut” of our news application was that all too often, school districts and states don't distribute educational opportunities to rich and poor kids equally.

We sent Narrative Science’s engineers a raw data set and walked them through the field layout and data gotchas.

Within a few weeks we started trading drafts – about a half dozen at a time -- in the form of sample narratives. We homed in until the narratives clearly expressed what the often-complex data meant. In addition to making sure our data was right and how we described it was consistent, we wanted the copy to tell the same story as our interactive pages did – a story not just about individual schools but how each school related to other schools in the state with different poverty levels. In that way, we worked with the engineers at Narrative Science much as an editor and reporter do.

We also had a few style issues, nothing that wouldn’t be familiar to editors and reporters everywhere. Smaller edits ranged from applying AP style to pruning redundant clauses.

One thing we learned was that mentioning a data point in a narrative made it seem much more important that simply including it on the page of an interactive database, so we spent time picking the right variables to “promote” to the narrative.

On to some of the more technical stuff: In addition to the data that’s visible on each page in the interactive database, our data includes some numbers used behind the scenes, such as calculated fields and an explicit pairwise relationship between similar schools. We found that it was easier to send those data points to Narrative Science rather than have them redo the often-complex calculations. Small differences between the narrative description and the graphic were cropping up, and that was the easiest way to avoid them.

Unlike a normal editing process, we weren't just working with a single story but tens of thousands of them. It isn’t practical to read 52,000 narratives. Also, Narrative Science’s systems are more complex than simple boilerplate with interpolated variables. Editing one narrative does not mean you’ve edited them all. In addition to recasting whole paragraphs, their systems generate a variety of sentences to express the same kind of data, so that reading the narratives for several schools would seem more natural and not automated. So edits that made sense in one case ended up not working in other cases, and sentences that seemed correct given one set of circumstances seemed wrong in others – often subtly.

We started getting larger samples of the generated narratives and we pulled random samples to spot check – just as we do with any data project – looking for problems with agreement between the narrative and the graphic, and for any confusing wording generated by the algorithm. Of course, if you see any weirdness – let us know.

Ultimately, Narrative Science delivered more than 52,000 narratives to us as json files over FTP, which we imported into our news application.

Whatever the future of the news industry holds, it seems clear that it will involve trying lots of things. And from our vantage point, algorithmic story generation seems like an intriguing solution to the problem of scaling narrative journalism in the era of big data. It’s helped us tune our data journalism for a new audience, and helped create stories at scale that would have been unthinkable otherwise.

Follow ProPublica