Making Sense of Messy Data

I used to work as a sound mixer on film sets, noticing any hums and beeps that would make an actor’s performance useless after a long day’s work. I could take care of the noisiness in the moment, before it became an issue for postproduction editors.

Now as a data analyst, I only get to notice the distracting hums and beeps in the data afterward. I usually get no say in what questions are asked to generate the datasets I work with; answers to surveys or administrative forms are already complete.

To add to that challenge, when building a national dataset across several states, chances are there will be dissonance in how the data is collected from state to state, making it even more complicated to draw meaning from compiled datasets.

The Associated Press recently added a comprehensive dataset on medical marijuana registry programs across the U.S. to the ProPublica Data Store. Since a national dataset did not exist, we collected the data from each state through records requests, program reports and department documents.

One question we sought to answer with that data: why people wanted a medical marijuana card in the first place.

The answers came in many different formats, in some cases with a single response question, in others with a multiple response question. It’s the difference between “check one” and “check all.”

When someone answers a single response question, they are choosing what they think is the most important and relevant answer. This may be an accurate assessment of the situation — or an oversimplified take on the question.

When someone is given the chance to choose one or more responses, they are choosing all they think is relevant and important, and in no particular order. If you have four response choices, you may have to split the data into up to 16 separate groups to cover each combination. Or you may be given a summary table with the results for each option without any information on how they combine.

In the medical marijuana data, some states have 10 or more qualifying conditions — from cancer and epilepsy to nausea and post-traumatic stress disorder. Of the 16 states where data on qualifying condition is available, 13 allow for multiple responses. And of those, three states even shifted from collecting single to multiple responses over the years.

“Your Default Position Should Be Skepticism” and Other Advice for Data Journalists From Hadley Wickham

The chief scientist at RStudio and developer of open source tools for data scientists on bribes, bears and where your next story is hiding.

This makes it nearly impossible to compare across states when given only summary tables.

So, what can we do?

One tip is to compare states that have similar types of questionnaires — single response with single response, multiple with multiple. We used this approach for clarification when looking into the numbers for patients reporting PTSD as a qualifying condition. We found that half of all patients in New Mexico use medical marijuana to treat PTSD, and the numbers do not seem to be inflated by the method of data collection. New Mexico asks for a single qualifying condition, yet the proportion of people reporting PTSD as their main ailment is two to three times the number than those that could report multiple responses in other states.

Using data from the 13 states that allow multiple responses, we found that when states expand their medical markets to include PTSD, registry numbers ramp up and the proportion of patients reporting PTSD increase at a quick pace. The data didn’t enable us to get one single clean statistic, but it still made it possible for us to better understand how people used medical marijuana.

Get the data (with a description of the caveats you’ll need to keep in mind when working with it) for your own analysis here.

Follow ProPublica