Narrative Science delivered more than 52,000 narratives to us as json files over FTP, which we imported into our news application.

How To Edit 52,000 Stories at Once

by Scott Klein

January 24, 2013, 10:19 am

Today we’re launching an
update to our Opportunity Gap news application, with two new data points, better
design on smartphones and integration with the Foursquare
social network.

Also launching with this
new release are short narrative descriptions of almost all of the more than 52,000
schools in our database, generated algorithmically by Narrative Science, a
startup based in Chicago. Narrative Science launched in 2010 out of a research
project at Northwestern University. Their platform uses artificial intelligence
to turn structured data into human-readable narratives. For example, Forbes
uses their technology to turn quarterly corporate earnings reports into short narratives.

We were introduced to
Narrative Science by the MacArthur Foundation, which has supported ProPublica
since our inception in 2007. In our initial conversations with the team at
Narrative Science, we realized that ProPublica’s data team and Narrative
Science share a common goal: To make data tell stories. We also had a
hypothesis, which is that adding narratives to each school page would provide an
easier way for people who learn verbally rather than visually to understand the
data.

Here’s how it worked, and
what you might expect if you’re a news organization looking to experiment with
algorithmically generated stories.

The project started with a
call with Narrative Science to talk about what was important in the data. We
told them that the “nut” of our news application was that all too often, school
districts and states don’t distribute educational opportunities to rich and
poor kids equally.

We sent Narrative
Science’s engineers a raw data set and walked them through the field layout and
data gotchas.

Within a few weeks we started
trading drafts – about a half dozen at a time — in the form of sample
narratives. We homed in until the narratives clearly expressed what the
often-complex data meant. In addition to making sure our data was right and how
we described it was consistent, we wanted the copy to tell the same story as
our interactive pages did – a story not just about individual schools but
how each school related to other schools in the state with different poverty
levels. In that way, we worked with the engineers at
Narrative Science much as an editor and reporter do.

We also had a few style
issues, nothing that wouldn’t be familiar to editors and reporters everywhere. Smaller
edits ranged from applying AP style to pruning redundant clauses.

One thing we learned was
that mentioning a data point in a narrative made it seem much more important
that simply including it on the page of an interactive database, so we spent
time picking the right variables to “promote” to the narrative.

On to some of the more
technical stuff: In addition to the data that’s visible on each page in the
interactive database, our data includes some numbers used behind the scenes,
such as calculated fields and an explicit pairwise relationship between similar
schools. We found that it was easier to send those data points to Narrative
Science rather than have them redo the often-complex calculations. Small
differences between the narrative description and the graphic were cropping up,
and that was the easiest way to avoid them.

Unlike a normal editing process,
we weren’t just working with a single story but tens of thousands of them. It
isn’t practical to read 52,000 narratives. Also, Narrative Science’s systems
are more complex than simple boilerplate with interpolated variables. Editing
one narrative does not mean you’ve edited them all. In addition to recasting whole
paragraphs, their systems generate a variety of sentences to express the same kind
of data, so that reading the narratives for several schools would seem more
natural and not automated. So edits that made sense in one case ended up not
working in other cases, and sentences that seemed correct given one set of
circumstances seemed wrong in others – often subtly.

We started getting larger samples
of the generated narratives and we pulled random samples to spot check –
just as we do with any data project – looking for problems with agreement
between the narrative and the graphic, and for any confusing wording generated
by the algorithm. Of course, if you see any weirdness – let us
know.

Ultimately, Narrative
Science delivered more than 52,000 narratives to us as json files over FTP,
which we imported into our news application.

Whatever the future of the
news industry holds, it seems clear that it will involve trying lots of things.
And from our vantage point, algorithmic story generation seems like an
intriguing solution to the problem of scaling narrative journalism in the era
of big data. It’s helped us tune our data journalism for a new audience, and
helped create stories at scale that would have been unthinkable otherwise.