Journalism in the Public Interest

The ProPublica Nerd Blog

How To Edit 52,000 Stories at Once


Narrative Science delivered more than 52,000 narratives to us as json files over FTP, which we imported into our news application.

Today we're launching an update to our Opportunity Gap news application, with two new data points, better design on smartphones and integration with the Foursquare social network.

Also launching with this new release are short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science, a startup based in Chicago. Narrative Science launched in 2010 out of a research project at Northwestern University. Their platform uses artificial intelligence to turn structured data into human-readable narratives. For example, Forbes uses their technology to turn quarterly corporate earnings reports into short narratives.

We were introduced to Narrative Science by the MacArthur Foundation, which has supported ProPublica since our inception in 2007. In our initial conversations with the team at Narrative Science, we realized that ProPublica's data team and Narrative Science share a common goal: To make data tell stories. We also had a hypothesis, which is that adding narratives to each school page would provide an easier way for people who learn verbally rather than visually to understand the data.

Here’s how it worked, and what you might expect if you’re a news organization looking to experiment with algorithmically generated stories.

The project started with a call with Narrative Science to talk about what was important in the data. We told them that the “nut” of our news application was that all too often, school districts and states don't distribute educational opportunities to rich and poor kids equally.

We sent Narrative Science’s engineers a raw data set and walked them through the field layout and data gotchas.

Within a few weeks we started trading drafts – about a half dozen at a time -- in the form of sample narratives. We homed in until the narratives clearly expressed what the often-complex data meant. In addition to making sure our data was right and how we described it was consistent, we wanted the copy to tell the same story as our interactive pages did – a story not just about individual schools but how each school related to other schools in the state with different poverty levels. In that way, we worked with the engineers at Narrative Science much as an editor and reporter do.

We also had a few style issues, nothing that wouldn’t be familiar to editors and reporters everywhere. Smaller edits ranged from applying AP style to pruning redundant clauses.

One thing we learned was that mentioning a data point in a narrative made it seem much more important that simply including it on the page of an interactive database, so we spent time picking the right variables to “promote” to the narrative.

On to some of the more technical stuff: In addition to the data that’s visible on each page in the interactive database, our data includes some numbers used behind the scenes, such as calculated fields and an explicit pairwise relationship between similar schools. We found that it was easier to send those data points to Narrative Science rather than have them redo the often-complex calculations. Small differences between the narrative description and the graphic were cropping up, and that was the easiest way to avoid them.

Unlike a normal editing process, we weren't just working with a single story but tens of thousands of them. It isn’t practical to read 52,000 narratives. Also, Narrative Science’s systems are more complex than simple boilerplate with interpolated variables. Editing one narrative does not mean you’ve edited them all. In addition to recasting whole paragraphs, their systems generate a variety of sentences to express the same kind of data, so that reading the narratives for several schools would seem more natural and not automated. So edits that made sense in one case ended up not working in other cases, and sentences that seemed correct given one set of circumstances seemed wrong in others – often subtly.

We started getting larger samples of the generated narratives and we pulled random samples to spot check – just as we do with any data project – looking for problems with agreement between the narrative and the graphic, and for any confusing wording generated by the algorithm. Of course, if you see any weirdness – let us know.

Ultimately, Narrative Science delivered more than 52,000 narratives to us as json files over FTP, which we imported into our news application.

Whatever the future of the news industry holds, it seems clear that it will involve trying lots of things. And from our vantage point, algorithmic story generation seems like an intriguing solution to the problem of scaling narrative journalism in the era of big data. It’s helped us tune our data journalism for a new audience, and helped create stories at scale that would have been unthinkable otherwise.

You mean that computers will generate coherent stories without human editing?  52,000?  I’m wondering about information overload and what are we (humans) going to be doing.  Is it okay to just turn this thing off and spend time with my granddaughters?

Karl, as I understand it, the idea is that no individual reader would care about more than, say, a dozen instances, but each instance is important to somebody.

The traditional market allocates the limited number of writers to the most popular spots and tells everybody who wants something more that it’s tough luck and they’ll have to hire their own analysts.

Systems like the one at Narrative Science allow those limited resources to cover everybody at some minimal level, which means that you can see whatever you find interesting, knowing that it’s there.

M.A. Pelletier

Jan. 25, 2013, 5:35 p.m.

If I understand this right, the data is incorporating everyone -talking small to giant- instead of a curve letting “less important” (read less common/popular) drop off the curve of weighing in? If so, I’m liking this a lot!

Do ongoing auditing and/or oversight mechanisms exist to ensure that the mechanically generated summaries are accurate?

A general question:
The Nerd Blog is great, but why so paginated?  Why can’t I see all your posts - titles, at least - on one page?
(marginally relevant: “Fragmenting data that should be seen together is a mistake”)

Elizabeth Farabee

Feb. 25, 2013, 12:13 p.m.

There is another solution on the market which is also able to make figures speak in natural language. Yseop does not require the customers to send the data out because the software can be installed locally on their PC. Sending data out can be a huge issue for some companies.
I actually work for them - and I’d love to explain the differences between the kinds of solution Narrative Science offers and what we do here at Yseop. Yseop is also able to write text in multiple languages—currently, this includes English, Spanish, French & German.

Roxsanne Small

Aug. 8, 2013, 10:38 a.m.

Please email me your explanation.

Elizabeth Farabee

Aug. 8, 2013, 10:41 a.m.

Hi Roxsanne,
What is your email address?
Thank you,

Roxsanne Small

Aug. 8, 2013, 10:44 a.m.

.(JavaScript must be enabled to view this email address) or call (518) 464-5307

Commenting is not available in this section entry.