Speaker of the House Paul Ryan is a tax wonk ― and most observers of Congress know that. But knowing what interests the other 434 members of Congress is harder.

To make it easier to know what issues each lawmaker really focuses on, we’re launching a new feature in our Represent database called Policy Priorities. We had two goals in creating it: To help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own and their communities.

We created Policy Priorities using some sophisticated computer algorithms (more on this in a second) to calculate interest based on what each congressperson talks ― and brags ― about in their press releases.

Voting and drafting legislation aren’t the only things members of Congress do with their time, but they’re often the main way we analyze congressional data, in part because they’re easily measured. But the job of a member of Congress goes well past voting. They go to committee meetings, discuss policy on the floor and in caucuses, raise funds and ― important for our purposes ― communicate with their constituents and journalists back home. They use press releases to talk about what they’ve accomplished and to demonstrate their commitment to their political ideals.

We’ve been gathering these press releases for a few years, and have a body of some 86,000 that we used for a kind of analysis called machine learning.

The press releases might comment about a bill that just advanced in committee or a post office that was renamed ― or they might be weighing in on issues of the day that have yet to be addressed by Congress. We think these press releases are a great signal for what a congressperson cares about ― or, at least, what they want their constituents to see them caring about.

Policy Priorities is inspired by some research by political scientist Justin Grimmer, who used a somewhat similar approach to calculate what he called an “Expressed Agenda” for each U.S. senator. Our model differs a bit so we’re not using the same name.

Grimmer says his approach yields an indirect representation of a legislator’s priorities. Rather than directly measuring the sum of what a legislator does, the expressed agenda measures what the legislator does filtered through what he or she thinks their constituents want to hear about: “legislators employ the resources of their office to portray how they are responding to the priorities and concerns of their constituents.” It’s not just that Ryan cares about taxes ― his press releases show that he wants to be known as someone who is deeply knowledgeable about the tax system.

Here’s how, in broad strokes, we calculate each Congress member’s Policy Priorities. It relies on two main ingredients.

The first is doc2vec ― the same machine learning algorithm that “learns” which members’ press releases sound most like another’s, a feature we released in October.

The second ingredient is a list of policy area categories, along with a few hand-picked keywords that are typical for that policy area: like “Obamacare” for the health policy area or “refugees” for immigration.

The categories we use — like the one called “civil rights and liberties, minority issues” ― are a slight variation on a set of policy areas created by the Congressional Research Service. One policy area is assigned to each bill introduced since 1973. We made some slight modifications to the categories ― combining “social sciences and history” with “arts, culture, religion” and eliminating “animals,” “law” and “Congress” completely.

With those two ingredients in hand, we look at each press release from the current and previous Congress and assess how similar the press release is to the “meaning” ― that is, what the doc2vec model learned ― of each of the categories. Our model uses that comparison to decide what policy areas a press release belongs to. Then we add up the number of press releases in each category to generate the breakdown for each member of Congress on the site.

There are some topics that don’t fit well into this taxonomy. “Government operations and politics” includes a lot of bills but probably isn’t top-of-mind for any given member: ones like renaming post offices.

Sen. Chuck Schumer's Policy Priorities, as calculated by ProPublica's machine learning model. (ProPublica)

A more glaring example is bills about abortion. Bills in favor of expanding abortion rights are often categorized as “health” bills, because they might, for instance, require health insurance plans to cover the costs of abortions. Bills that seek to restrict abortion may be categorized under “crime and law enforcement” if they seek to criminalize late-term abortions. Because doc2vec “knows” that words about abortion refer to the same idea, whether they’re from pro-choice or pro-life legislators, a member who focuses on abortion might have a relatively high proportion of their press releases coded as both “crime and law enforcement” and “health,” regardless of their view on the issue.

Nevertheless, the Policy Priorities broadly match up with what a congressional expert might say each congressperson focuses on. We validated them by checking to see if congressional committee chairs have a Policy Priority score above the median for issues that their committee deals with ― and 78 percent of them do. And members of Congress from the West are those who focus most on regional topics like “public lands and natural resources.”

We also validated the topics assigned to each press release by examining press releases that mention a bill number (e.g. “H.R. 1234”) and using the Library of Congress’ assigned policy area for that bill as a “ground truth” for the topic of the press release. We then compared the model’s guesses of one to four policy areas to the policy area of the bill; 81 percent got the correct answer. This validation method isn’t perfect, in that most press releases don’t mention a bill number and, even among those that do, this “ground truth” may not actually match up with the real ground truth ― what a human being might say the press release is about. Some press releases talk about a facet of an issue that may not be what the Library of Congress chose to focus on in its classification. For instance, a bill to create a “workforce development tax credit” is classified as a bill about taxation ― perhaps because, on a fine-grain level, a piece of the tax code is what the law would change ― while a press release on the topic instead focuses on workforce development issues, that is, what the model sees as “labor and employment.”

In contrast to the Distinctive Topics feature we added in October ― which listed any topic under the sun ― from “email privacy” to “sage grouse” ― there are only 24 possible Policy Priorities, so you can compare one member to another.

We’re putting Policy Priorities in the gray “Representative Topics” box on every member’s page, and our pages for bills in each of the policy areas ― like this one, for bills about emergency management ― now include a list of the lawmakers who focus most on that area.

Here’s a more detailed and technical description of how we generate Policy Priorities:

The doc2vec algorithm, after being fed years of press releases from Congress, counts which words tend to occur in the same contexts and uses those to “learn” a model of the subset of the English language used in congressional press releases. It discovers that words that often appear together, but appear less often apart, are likely about the same topic; those that occur in almost the exact same contexts mean approximately the same thing. It assigns each word and each press release a numeric vector in 100-space where similar words have vectors that are close to one another. The relationships between these vectors mirror the relationships between words, so that if you were to ask the model who has the same kind of relationship to Senate Majority Leader Mitch McConnell that Rep. Nancy Pelosi has to Rep. Paul Ryan, it’ll answer Chuck Schumer ― the Democratic minority leader in the Senate. (Well, it’s a tie: Schumer and his predecessor in that position, Harry Reid.)

Once the model is trained, each press release’s document vector (or, for very new press releases that weren’t available at model-training time, an inferred document vector) is compared to the average word vector of the keywords for each category. Each press release is assigned one to four categories, with their weight proportional to the distance between the document vector and the averaged word vectors; we discard every category whose similarity is less than 80 percent of the top category’s.

And finally, adding up proportions of the categories for each lawmaker’s collection of press releases since 2014 gives us their Policy Priorities.

Our thanks to Justin Grimmer, Ines Montani and Laura Bronner for their help in discussing how to implement Policy Priorities.