This story was co-published with Data Journalism China.

Like a lot of my classmates at Columbia, I had been following the work of ProPublica's news applications desk since I started learning data journalism. So when I received the call from ProPublica's news application editor, Scott Klein, telling me I was going to be their summer Google Journalism Fellow for 2014, I could not believe that I was going to work with his team.

Though my fellowship ended at the end of last summer, I stayed on in a different fellowship. Working here these last seven months has allowed me to better understand how to build news applications and the methodologies behind them. Here are a few things I learned in the last few months.

Data Collection and Data Analysis

ProPublica's news application team produces at least 12 large scale interactive applications a year. In most cases, data collection and data analysis are the most time-consuming parts of  a project.

Web scraping is one of the major data sources for ProPublica's projects. Scraping and cleaning data can take a while, but it can be fun, when it works. Take Dollar for Docs for example. It takes a news developer here more than half a year to scrape and check all the data and to update the application. This nerd blog post explains in detail how ProPublica extracted data from a variety of not-so-friendly sources.  

The FOIA (Freedom of Information Act) request is another major data source for ProPublica. ProPublica, like other investigative newsrooms, has a complex relationship with FOIA requests. It is one of the main ways to get data that the government is reluctant to disclose. But the process can be long and painful. Every federal agency has slightly different rules, and each state can have very different laws about what's required to be released under FOIA, and under what conditions. Not every state is like Louisiana, which requests government agencies to get back to FOIA requestors within three days. FOIA requests in general take months to process. If you hear happy cheers from a corner of the newsroom, it is very possible that a FOIA package just arrived. Receiving the parcel does not mean that you can roll up your sleeves and start your analysis. On good days, FOIA responses come in a structured format like Excel or csv. On bad days, they come as scanned PDFs or sometimes even on paper. Sometimes the government agencies make excuses: the database is too big to export; your request is too complicated and it is going to take three or more years to process; we need to hire someone to process your request, which will cost you $3000;  etc. Under these circumstances, the reporter has to negotiate (or even fight) for the dataset.

Other than web scraping and FOIA requests, ProPublica also collects its own data. For China's Memory Hole,  ProPublica built a database and collected images deleted from Weibo, China's version of Twitter, and effectively showed what topics got images deleted the most.

Reporters here believe that statistics should play a bigger role in storytelling. Everyone gets a copy of "Numbers in the Newsroom," a classic guide by Sarah Cohen. R is the primary data analysis tool at ProPublica but people here also use Ruby and Python for data cleaning and analysis. Pro-tips:  Don't touch a dataset directly. Always try to write a script to wrangle it, so that you will not only avoid manual mistakes but also be able to replicate exactly what you did when the original dataset is updated. Even better, build a rails app to clean and analyze the dataset even when you are not building an interactive project.

News Application and Interactive Graphic Design

I read ProPublica's Nerd Guides before I started here. I highly recommend it. It helps you avoid common mistakes in news application and data visualization design. Stay away from pie charts unless you are showing relationships between one part and one whole; scatterplots are good for showing correlation between two variables and line charts are for continuous variables; avoid 3-D charts and donut charts at all cost, etc.  It also includes a general guide for how to structure a news application, a coding manifesto, and a data bulletproofing guide.  

News applications follow a certain structure here that helps the readers understand the story. Every news application is designed with a far view and a near view. Far view provides context for the data, telling the story on the highest level. The near view personalizes the story, enabling people to position themselves in it and to find out why they should care about the story. People can look up their city, their school district, their doctors, their health plan, etc.  More detailed guides on news application design can also be found in the Nerd Guides.

But good design requires considerations that go beyond basic principles.  Almost every news application or interactive graphic here at ProPublica goes through the "design - demo - redesign" cycle. Sometimes the whole team gathers in the conference room to brainstorm for a news application and come up with better ideas of data visualization or UI design. That is where I learned the most about the nuance of more effective and less effective data representations.

When I started at ProPublica, putting together a bunch of code and seeing it work was so exciting for me and I never wanted to show people my "dirty laundry". As I started working with other developers in the newsroom, I soon found out that I had to. They nicely pointed out my bad habits in coding and told me how to change them. Follow different naming conventions in different languages; indent your code to make it easier to read and to debug; clearly define the scope of variables instead of using global variables everywhere; try creating functions and classes when you find you are repeating yourself; try to adopt prototype-based programming style; avoid z-index but if you have to use it, don't be crazy and make it equal to 10000, etc. In a newsroom, deadline is a higher priority than coding style. But getting into good coding habits clearly helped. They are crucial in making sure that the code works efficiently and that it is easy to collaborate with another developer.

Venturing Into Unchartered Territory

Another bonus of working at ProPublica is to watch people experiment with new technologies. Satellite imagery processing and taking pictures with balloons don't happen every day in this newsroom. But when it does, it's really exciting. This blog post explains how ProPublica used satellite imagery and aerial imagery to tell the story of Louisiana's land loss and land gain.

Losing Ground is a huge news application with rich content. To make sure that users will be able to navigate through the app without missing the most important information, ProPublica conducted its very first formal user test on the news application.  ProPublica recruited five users randomly from Twitter and had a 30-minute one-on-one test with each of them.  Through a shared screen, the designer was able to see if users followed the path they had in mind. The users played with the app at will and told the designer which part looked confusing. The user test went great and the designers were able to redesign parts of the app according to findings in the user test.

For smaller projects on a tight deadline, ProPublica asks for help within the newsroom. Sometimes the news application developer asks reporters from the other side of the newsroom to sit down and play with the app. Sometimes the developer asks visitors in the newsroom to provide suggestions. The idea is to get someone who is totally unfamiliar with the subject to comment on whether the news app conveyed the information loud and clear.  These semi-user tests are not perfect substitutes for the real ones. It is still on ProPublica's wish list to conduct real user tests for as many news apps as possible.

The last few months have been an exciting and rewarding adventure for me. I got my hands on collecting raw data, designing visualizations and building rails apps. If you are excited about the ideas above and want to explore them yourself, you should apply to ProPublica's Google Journalism Fellowship.