Challenges Using Machine Learning Classification on the News

News isn’t as easy as circles and squares…

At Public Good, we provide tools that let people take action on the news. Whether it’s the refugee crisis, climate change, or a gang shooting, our goal is to allow people to help make the world better when they are motivated by good journalism. It’s not a completely new idea, but we revolutionized it by bringing machine learning to the party. In today’s overtaxed newsrooms, it’s not possible for reporters or editors to take on another responsibility. So for taking action to flourish, it needs to be automated and as simple to implement as a social media button.

For us, this begins with the problem of classifying news content. Until you know what a story is about, you can’t recommend actions. Unfortunately, the idea of a semantic web has failed us and there are no consistent tags or metadata to point in the right direction, and even if there were it’s likely the taxonomy used for navigating a news site (e.g. sports, entertainment, lifestyle, news) would differ from the taxonomy needed for taking action (e.g. violence, poverty, natural disasters) with no obvious mapping between them. So the most obvious first approach is to use machine learning classification algorithms. While we still use these methods as part of our overall system, we discovered they underperformed our expectations (how we expected them to perform on arbitrary textual data given our large training set and relatively small number of classes) and we’re beginning to understand why.

Most major classification algorithms make use of word frequency (or phrase frequency) as part of their analyses. When we first looked at the total number of unique words and phrases (high) this made us optimistic that these algorithms would perform well on our content. But a closer look reveals that the distribution of terms is highly imbalanced. A relatively small vocabulary of common words makes up the vast bulk of content (which we might have expected given that news content tends to be written for a wide audience with varying language skills and reading levels), while the big variety of words is substantially made up for proper nouns and other types of entities.

In retrospect, this isn’t surprising. News tends to be about people, places, and organizations. And unlike other kinds of textual data, these entities tend to enter the news vocabulary suddenly (e.g. “Hurricane Irma”) and often leave it just days or weeks later as the news cycle moves on. To a human brain, we often need only say a phrase like “Harvey Weinstein” to immediately register “sexual harassment”, but classifiers looking at months or years of historical data are proving less effective at making that determination — while the term is highly relevant, it’s only in a very small sample and, just weeks before it came to mean sexual harassment, it would have been a ringer for TV and movies.

While working to optimize our core classifiers, we’ve seen that some mitigation strategies can help a lot. First, online learning gets breaking news terms into the algorithms more quickly than batch training. Second, retiring old content from a training corpus as quickly as possible reduces the chance that a term that has become meaningful in breaking news will be associated with a previous meaning. And third, tweaking tolerances for stop terms makes a lot of difference for the bulk of language.

Most importantly, we learned that classifying the news accurately is a lot more complicated than setting up an off-the-shelf classifier and feeding it a bunch of data. Operational optimizations can help, but to be accurate on breaking news, ensemble ML methods and a breaking news team are critical.