# How ProPublica’s Message Machine Reverse Engineers Political Microtargeting

ProPublica has been collecting political emails for a project we call the Message Machine, with the help of more than 600 readers who have shared demographic information with us and who have so far forwarded us more than 30,000 political emails. The Message Machine identifies and classifies many different emails as different variations of a single email blast, to highlight and analyze the ways in which campaigns use their sizable databases to customize their messages based on what they know about each recipient.

We’ve been publishing our collection for a few months, and today we are publishing the latest results of our analysis. The Machine uses techniques from Natural Language Processing and Machine Learning to analyze every email our readers forward to it.

## TF-IDF

In order to find the variations in language in each email that the Machine receives, the Machine relies on two simple concepts, one called TF-IDF and the other Cosine Similarity.

TF-IDF stands for “Term Frequency-Inverse Document Frequency,” and is a way to assign weights to individual words based on their importance in a collection of documents. It helps us discard common words like ‘the’ and ‘and’ at scale because they should not influence our measure of similarity between documents. TF-IDF is a workhorse of NLP, and it is straightforward to calculate.

First you calculate the Term Frequency for a word:

Basically what that says is for every word in a document calculate the number of times it appears, and then divide by the number of times the most common word appears in the document. That represents the term frequency of a particular word. And then you move on to the Inverse Document Frequency:

Calculating the Inverse Document Frequency of a word goes like this: take the number of documents you have and divide by the number of documents a given word appears in, and then take the logarithm of the result. Then you multiply the tf and idf together to get tfidf:

When you multiply TF and IDF together you get a score representing a value for the importance of the words in a document corrected by their importance in the collection of documents. The effect of this calculation is that rare words are more important than common ones in similarity calculations.

## Cosine Similarity

When the Machine receives an email it needs to find others that have similar content. There are many document similarity algorithms — two that I find most beautiful are the sim hash and min hash algorithms — but I chose a more standard measure called Cosine Similarity. Cosine Similarity is another important technique in NLP and is equally easy to calculate:

In concrete terms, Cosine Similarity measures the angle between the two vectors formed by each document’s words (technically, it is the angle between the two hyperplanes that the vectors represent). To calculate Cosine Similarity for two documents you take the dot product of each document’s word vector and divide that by the product of each vector’s magnitude. Because the TF-IDF scores in the documents are all positive, it will produce a number between zero and one. A pair of documents with a cosine similarity of one are identical, and a Cosine Similarity of zero would describe a pair of documents that have no words in common.

Every email blast in the Machine has one or more variations. The machine finds these variations by comparing every new email to the existing variations from each email blast, and if the new email isn’t similar to the existing variations, e.g. below a certain cosine similarity threshold, the Machine creates a new variation.

It is difficult to model these word vectors in a relational database, especially if you want to calculate TF-IDF and Cosine Similarity for the documents quickly. So the Message Machine uses a key-value store database that I built called Daybreak, which we will open source very soon.

A side benefit of storing the word indexes in a custom database is that we can use Cosine Similarity to power a front-facing search engine for the email collection. When a user searches the site, the Machine looks up the query words in our email indexes and returns an list of emails ordered by each email’s Cosine Similarity to the user’s query. The code is really very small.

## Decision Trees and Machine Learning

Once the Machine has clustered the emails and found the variations, it takes the demographics of the users who sent in the email and creates a Machine Learning model called a Decision Tree for every variation of each email &8212; except the most popular one. The Machine throws away the most popular email variation because, after reviewing the data, we found that, by and large, the most popular variation was the one with the least significant variations. In practical terms the most popular variation is likely either the “control” version, or the one sent to folks the campaigns do not know much about.

I decided to use decision trees in the Machine after speaking with Prof. David Karger at MIT. He advised that, like many Machine Learning algorithms, Decision Trees provide results with any amount of data and that the results get better with more examples. More importantly, Decision Trees give easily parsable rules of the classification structure instead of lists of numbers like other Machine Learning classifiers.

Incidentally, Karger is well known for the wonderful algorithm, Karger’s algorithm, that finds the minimum cut of a graph, and you should try to find some time to read up on it.

Decision Trees are useful at finding hidden models in a particular data set, particularly because they produce a human-readable tree of partitions of the data. For example, here is a representation of the Decision Tree model generated for a single Obama for America message:

```
if(donation signal > 69) {
if(state == 21) {
return cluster<19546>
} else {
return cluster<19548>
}
} else {
if(age < 44) {
if(donation signal > 64) {
return cluster<19548>
} else {
return cluster<19546>
}
} else {
return cluster<19545>
}
}
```

The decision tree algorithm is a very simple tree-building algorithm, in pseudocode:

```
build_tree(data)
return if data.length <= 1
for each attribute in data
calculate information gain on attribute
let @best = maximum information gain
split the data set based on @best into nodes @left and @right
build_tree(@left)
build_tree(@right)
```

The formula the Machine uses to decide the best split is a straight entropy calculation:

Where I is the information content of the value with attribute x at index i:

Where P is the probability that the attribute has that value across the data set.

Then the Machine calculates information gain for each split by looping over every possible value for an attribute and by calculating the difference between the entropy of the overall data set and the entropy of splitting the data on that value:

Where T denotes the data set for a node. The best split is the one that will produce the highest information gain. The split routine is a binary split function that for categorical variables splits based on whether or not a row’s attribute equals the condition, and for continuous variables, whether or not the attribute is greater or equal to the condition.

After building a tree based on these splits, the Machine prunes the tree by asking each node if it would be better if it were a leaf or if it were a branch by calculating a Laplace Error Estimate for each case. Because the splits may end up with rows that belong to many different email variations, this error estimate throws away decision rules — by converting a branch into a leaf — that were assigned to too many different email variations. Luckily, it is pretty simple to calculate:

In the above formula N is the number of examples in a node, k is the number of classes — e.g. email variations — and n is the number of examples belonging to the most popular class.

In order to calculate the accuracy of the models, the Machine holds one example from each email variation back, and then it uses those to test the decision tree's accuracy using a standard calculation called the F Score. An F Score is calculated by counting the true positives (results that returned the right variation), false negatives (results that did not return the right variation) and false positives (results that returned the wrong variation), and then calculating precision, the fraction of returned results that are relevant:

and recall, the fraction of relevant instances retrieved:

and finally the F Score for the model:

Which will return a number between 0 and 1. The machine only assumes that models with an F Score higher than 0.90 are accurate models.

Once the Machine finds an accurate model, it does a Depth First Search of the tree and collects the demographic factors that produced each split. That set of factors is what the Machine presents on individual email pages.

A final note about the demographic information the Machine uses to create the decision trees: the Machine uses basic demographic information like age, sex, household income, state, and whether the user is a likely voter. It also includes an extra bit of information we’re calling a “donation signal” that represents the average previous donation amount asked for by each campaign by user. We infer the "donation signal" from the emails themselves instead of relying on data from our demographic questionnaire. Many Obama for America emails contain a “quick donate” box that asks for more money when a particular constituent has saves his credit card information. If an email has that box we assume the user is a previous donor. Also, some emails explicitly say the recipient is a previous donor.

This extra bit of information is actually the most common predictor of which email a particular person will get, and makes our models much stronger.

If you’d like to see the Message Machine in action, and see how political campaigns are targeting you, please forward political emails you receive to **emails@messagemachine.propublica.org**

# Jeff Larson

Jeff Larson is the Data Editor at ProPublica.

## 6 comments

Christen McCurdy

Oct. 18, 2012, 4 p.m.

The “results of our analysis” link in this story doesn’t work.

Jeff LarsonProPublica

Oct. 18, 2012, 4:31 p.m.

Thanks! Fixed that, it goes here now:

http://www.propublica.org/article/message-machine-starts-providing-answers

John

Oct. 18, 2012, 4:45 p.m.

Hm. Is there a reason you went with a tree-based approach rather than something more like clustering around centroids? I’m guessing it’s equal parts speed and the “low startup threshold” you mention, but it seems like the centroid-based approaches would more easily scale to other dimensions like demographic information about the recipients and dates of arrival.

As an aside, I’ve had to teach a few database courses over the years and could never quite find the way to present vector databases other than handwaving them. I’ve never seen anybody attack it from the similarity standpoint, and that focuses the idea a lot better, I think.

(I’m also completely ashamed to say that I had to invent something a lot like TF-IDF for a toy project at the day job. A shame that’s not more popularized.)

Jeff LarsonProPublica

Oct. 18, 2012, 4:50 p.m.

Well, decision trees give you a list of factors that contribute to the model, and honestly it was a lot simpler than running SVD or a Regression model, where it would be harder to single out how the factors interacted because they’d be assigned scores. We are using a very simple clustering algorithm to find the variations for each email, but that only looks at the word vectors. I played with adding time into the initial clustering, but it didn’t seem to matter much.

John

Oct. 19, 2012, 9:32 a.m.

That makes perfect sense. I often find myself managing large masses of data, but don’t often have the opportunity to go through analysis, so it’s definitely interesting to see how people with the opportunity handle it.

Interesting that timing didn’t have an effect. I would have expected it to show some “fine tuning” or divergence/convergence in message, based on the campaign. In itself, that lack of a correlation strikes me as interesting. Gives a sense that maybe they’re not as sophisticated as they’re giving themselves credit for.

clarence swinney

Oct. 21, 2012, 9:14 a.m.

INEQUALITY YES expanded repeat but oh oh so important for the Middle Class

Net Wealth-2010

1%=35%

5%=635

10%=77%

80%=11%

Financial Wealth

1%=42%

5%=72%

10%=85%

80%=5%

OECD rank—4th on inequality—3rd as Least Taxed—2nd as Least tax on corporations

Reagan began redistribution upwards with his 60% tax cut for top incomes.

Ask any friend to guess how much we pay in federal-state-local taxes as a percent of gdp.

No one I asked ever said under 50%. It is 27%. 27%.

In Fiscal 2012 federal tax revenue was 2450Billion. We borrowed 1100 Billion.

Our national income is 14,000 Billion. 2450 is a Tax Rate of 17.6%

Since 1980, we borrowed 15,000 Billion which let the rich off the hook to retain and grow their wealth.

They pay most of individual income tax but little of income in payroll tax. The big question is how much of Total Income do they pay not the Adjusted Gross Income which has their deductions.

Mitt Romney is proposing a 20% tax cut which means more borrowing like Bush who gave top 5% 48% of his tax cuts and borrowed 6100 Billion(5800 on 9-30-1 to 11.900 on 9-30-09) or doubled our debt in 8 years.

We have the Income to balance our budget but we must tax the top10% to do it.

Clarence Swinney political historian Lifeaholics of America—burlington nc

author-Lifeaholicâ€”Success by working for a Life not just a Living.