So you want to explore the world through data. But how do you actually *do* it?
Hadley Wickham is a leading developer of open source tools for data science and works as the chief scientist at RStudio. We talked with him about interrogating data, what stories might be hiding in the gaps and how bears can really mess things up. What follows is a transcript of our talk, edited for clarity and length.
ProPublica: You’ve talked about the way data visualization can help the process of exploratory data analysis. How would you say this applies to data journalism?
Wickham: I’m not sure whether I should have the answers or you should have the answers! I think the question is: How much of data journalism is reporting the data that you have versus finding the data that you don’t have ... but you should have ... or want to have ... that would tell the really interesting story.
I help teach a data science class at Stanford, and I was just looking through this dataset on emergency room visits in the United States. There is a sample of every emergency visit from like 2013 to 2017 ... and then there’s this really short narrative, a one-sentence description of what caused the accident.
I think that’s a fascinating dataset because there are so many stories in it. I look at the dataset every year, and each time I try and pull out a little different story. This year, I decided to look at knife-related injuries, and there are massive spikes on Memorial Day, Fourth of July, Thanksgiving, Christmas Day and New Year’s.
As a generalist you want to turn that into a story, and there are so many questions you can ask. That kind of exploration is really a warmup. If you’re more of an investigative data journalist, you’re also looking for the data that isn’t there. You’ve got to force yourself to think, well, what should I be seeing that I’m not?
ProPublica: What’s a tip for someone who thinks that they have found something that isn’t there. What’s the next step that you take when you have that intuition?
Wickham: This is one of the things I learned from going to NICAR, which is completely unnatural to me, and that’s picking up a phone and talking to someone. Which I would never do. There is no situation in my life in which I would ever do that unless it’s life-threatening emergency.
But, I think that’s when you need to just start talking to people. I remember one little anecdote. I was helping a biology student analyze their field work data, and I was looking at where they collected data over time.
And one year they had no data for a given field. And so I go talk to them. And I was like: “Well, why is that? This is really weird.”
And they’re like, well, there was a bear in the field that year. And so we couldn’t collect any data.
But kind of an interesting story, right?
ProPublica: What advice would you have for editors who are managing or collaborating with highly technical people in a journalism environment but who may not share the same skill set? How can they be effective?
Wickham: Learn a little bit of R and basic data analysis skills. You don’t have to be an expert; you don’t have to work with particularly large datasets. It’s a matter of finding something in your own life that’s interesting that you want to dig into.
One [recent example]: I noticed on the account from my yoga class, there was a page that has every single yoga class that I had ever taken.
And so I thought it would be kind of fun to take a look at that. See how things change over time. Everyone has little things like that. You’ve got a Google Sheet of information about your neighbors, or your baby, or your cat, or whatever. Just find something in life where you have data that you’re interested in. Just so you’ve got that little bit of visceral experience of working with data.
The other challenge is: When you’re really good at something, you make it look easy. And then people who don’t know so much are like: “Wow, that looks really easy. It must have taken you 30 minutes to scrape those 15,000 Excel spreadsheets of varying different formats.”
It sounds a little weird, but it’s like juggling. If you’re really, really, really good at juggling, you just make it look easy, and people are like: “Oh well. That’s easy. I can juggle eight balls at a time.” And so jugglers deliberately build mistakes into their acts. I’m not saying that’s a good idea for data science, but you’ve taken this very hard problem, broken it down into several pieces, made the whole thing look easy. How do you also convey that this is something you had to spend a huge amount of time on? It looks easy now, because I’ve spent so much time on it, not because it was a simple problem.
Data cleaning is hard because it always takes longer than you expect. And it’s really, really difficult to predict in advance where the problems are going to lie. At the same time, that’s where you get the value and can do stuff that no one has done before. The easy, clean dataset has already been analyzed to death. If you want something that’s unique and really interesting, you’ve got to dig for it.
ProPublica: During that data cleaning process, is that where the journalist comes out? When you’re cleaning up the data but you’re also getting to know it better and you’re figuring out the questions and the gaps?
Wickham: Yeah, absolutely. That’s one of the things that really irritates me. I think it’s easy to go from “data cleaning” to “Well, you’ve got a data cleaning problem, you should hire a data janitor to take care of it.” And it’s not this “janitorial” thing. Actually cleaning your data is when you’re getting to know it intimately. That’s not something you can hand off to someone else. It’s an absolutely critical part of the data science process.
ProPublica: The perennial question. What makes R an effective environment for data analysis and visualization? What does it offer over other tool sets and platforms?
The first question you should ask yourself is: Do I want to use something point and clicky, or do I want to use a programming language? It basically comes down to how much time do you spend? Like, if you’re doing data analysis every day, the time it takes to learn a programming language pays off pretty quickly because you can automate more and more of what you do.
So, I think the main competitors are R and Python for all data science work. Obviously, I am tremendously biased because I really love R. Python is awesome, too. But I think the reason that you can start with R is because in R you can learn how to do data science and then you can learn how to program, whereas in Python you’ve got to learn programming and data science simultaneously.
R is kind of a bit of a weird creature as a programming language, but one of the advantages is that you can get some basic templates that you copy and paste. You don’t have to learn what a function is, exactly. You don’t have to learn any programming language jargon. You can just kind of dive in. Whereas with Python you’re gonna learn a little bit more that’s just programming.
ProPublica: It’s true. I’ve tried to make some plots in Python and it was not pretty.
Wickham: Every team I talked to, there are people using R, and there are people using Python, and it’s really important to help those people work together. It’s not a war or a competition. People use different tools for different purposes. I think is very important and one project, to that end, it is this thing called Apache Arrow, which Wes [McKinney] has been working on because of this new organization called Ursa.
Basically, the idea of Apache Arrow is to just to sit down and really think, “What is the best way to store data-science-type data in memory?” Let’s figure that out. And then once we’ve figured it out, let’s build a bunch of shared infrastructure. So Python can store the data in the same way. R can store the data in the same way. Java can store the data in the same way. And then you can see, and mostly use, the same data in any programming language. So you’re not popping it about all the time.
ProPublica: Do you think journalists risk making erroneous assumptions about the accuracy of data or drawing inappropriate conclusions, such as mistaking correlation for causation?
Wickham: One of the challenges of data is that if you can quantify something precisely, people interpret it as being more “truthy.” If you’ve got five decimal places of accuracy, people are more likely to just kind of “believe it” instead of questioning it. A lot of people forget that pretty much every dataset is collected by a person, or there are many people involved. And if you ignore that, your conclusions are going to possibly be fantastically wrong.
I was judging a data science poster competition, and one of the posters was about food safety and food inspection reports. And I … and this probably says something profound about me ... but I immediately think: “Are there inspectors who are taking bribes, and if there were, how would you spot that from the data?”
You shouldn’t trust the data until you’ve proven that it is trustworthy. Until you’ve got another independent way of backing it up, or you’ve asked the same question three different ways and you get the same answer three different times. Then you should feel like the data is trustworthy. But until you’ve understood the process by which the data has been collected and gathered ... I think you should be very skeptical. Your default position should be skepticism.
ProPublica: That’s a good fit for us.