If you asked congressional experts what legislative subjects, say, Sen. Patty Murray of Washington specializes in, they’d have a few pretty good guesses: maybe education and health care — because she’s the ranking member on a key committee that oversees those issues. If you asked who else in the Senate shares her interests, you might hear Sen. Michael Bennet of Colorado. Why? Because he is a former school superintendent and a member on that same committee.
You could ask them the same question about more members of Congress, but before you got through all 535 lawmakers, they’d probably hang up on you.
But what if we could teach a computer what specific topics are distinctive to each member? We did just that. We trained a computer model to extract what phrases a Congress member uses more than the rest, using hundreds of thousands of press releases from 2015 to the present.
We hope this addition to Represent’s member pages will give constituents new insight into what the people who work in their names specialize in, whether it’s hot-button national issues or local happenings.
Many of the results are intuitive: Rep. Jared Polis, a Democratic representative from Colorado who is known as a civil libertarian, has “email privacy” as a topic; the model also knows Sen. Mitch McConnell, the Kentucky Republican, talks often about “coal miners.”
But the model’s strength is not in making obvious observations, but spotting things others might not. The model has picked up on New Jersey Democrat Rep. Josh Gottheimer’s use of the phrase “moocher states,” for example, a phrase more closely associated with libertarian groups than his own party. And the model recognizes Rep. Yvette Clarke’s interest in “confederate generals,” as it relates to street names in Fort Hamilton, near her Brooklyn, New York, district.
The model notices issues that aren’t quite on the national radar, like the “wotus rule” — AKA, the Waters of the United States Rule, a change in who regulates water pollution that has raised the ire of Republicans such as Rep. Bob Gibbs of Ohio. Or widespread interest among representatives of the rural West, including Sen. Mike Enzi of Wyoming and Rep. Rob Bishop of Utah, about whether to add the sage grouse to the endangered species list, triggering rules that could limit farming and industry near the bird’s habitat.
Just because a topic appears on one member’s list but not another’s doesn’t mean the second Congress member don’t care about it. There may simply be more distinctive topics that they talk about. And for now, that means big topics that lots of representatives and senators talk about, such as education or crime, aren’t included in each member’s list. But we’re working on ways to reflect those, too.
Along with identifying discrete topics, the model finds which members of Congress’ press releases are most similar, in topic or turns of phrase, in essence calculating who “sounds like” whom.
The representative whose press releases are closest to Rep. John Lewis’ is Rep. A. Donald McEachin, another African-American Democrat from a southern state. Rep. Thomas Massie, the model says, puts out releases similar to Sen. Rand Paul, his fellow Kentuckian who also leans libertarian.
How the Model Works
Our code relies on an approximation of what English words mean created by mathematically representing the context in which they occur. The theory that this would give you an idea of words’ meanings is called “Distributional Semantics.”
Why the particular technique we use, called Word2Vec, works so well is a bit of a mystery — especially if you, like me, never studied linear algebra — but it does work. Without being explicitly programmed to know anything about U.S. politics, the model has learned a lot about how our country works:
It knows that “death tax” and “estate tax” refer to the same thing.
If you ask the model who has the same kind of relationship to Senate Majority Leader Mitch McConnell that Rep. Nancy Pelosi has to Rep. Paul Ryan, its answer is Sen. Chuck Schumer — the Democratic minority leader in the Senate. (Well, it’s a tie: the model suggests Schumer and his predecessor in that position, Harry Reid.)
A related technique, Doc2Vec, assigns a value to individual press releases or a member’s entire body of press releases from the sum of the meanings of the words. Similar to the way in which DW-Nominate, a powerful statistical technique used to characterize where politicians stand along a political spectrum, transforms a congressperson’s voting record into a location in two dimensions, Doc2vec transforms what the Congress member says into a location in 100 dimensions. (However, unlike DW-Nominate, there’s no good way to translate those dimensions into anything that makes analytical sense to humans.) Finding Congress members who sound alike is as easy as finding each member’s “nearest neighbor” in this imaginary 100-dimensional space.
The topics are generated in a way that uses the same software, called Gensim, but relies less on linear algebra and more on counting. It finds the phrases that occur most often in each member’s statements but rarely in everyone else’s — a statistical technique called term-frequency (over) inverse-document-frequency (often shortened to “TF-IDF”) that is a useful proxy for importance. More concretely, it finds that Sen. Enzi’s statements contain the phrase “sage grouse” a lot, but that phrase appears frequently in only a few other members’ statements. A more general topic like “environment” would not show up, since it’s relatively common and only one word long.
The results of the TF-IDF algorithm are not presented verbatim; we do some manual filtering to exclude, say, the name of the member’s contact person for press releases or the phrasing of their “contact me” button.
There’s more in store. Stay tuned for a way to see what bills are related to a given topic — in a way that’s more powerful than just a keyword search. We’re also planning to throw floor statements into the model, as part of the relaunch of the CapitolWords project we inherited from Sunlight Labs earlier this year.
So how did our algorithm do on Murray? Broader topics like “education” and “health care” tend not to get noticed, in lieu of more specific pieces of the topic, like “Trumpcare bill,” a topic the algorithm identified as one of Murray’s. And the algorithm does list Murray as one of the members most similar to Michael Bennet. Pretty decent for some math and a pile of press releases.