This week I’m rounding off the first semester of classes on the new MA in Data Journalism with a session on artificial intelligence (AI) and machine learning. Machine learning is a subset of AI — and an area which holds enormous potential for journalism, both as a tool and as a subject for journalistic scrutiny.
So I thought I would share part of the class here, showing some examples of how the 3 types of machine learning — supervised, unsupervised, and reinforcement — have already been used for journalistic purposes, and using those to explain what those are along the way.
Supervised learning: Doctors & Sex Abuse
Supervised learning is a machine learning approach that is often used to classify things, or perform some sort of regression to establish the strength of a relationship between two things (which can then be used to make predictions based on new data).
For example, you might give a supervised learning algorithm some training data that includes a person’s Facebook updates, and their known political leaning. Once ‘trained’ on this data, it might then be let loose on more data where it can try to guess or estimate political leaning based on what it learned.
The Atlanta Journal-Constitution‘s investigation into Doctors & Sex Abuse is one example of the use of such an approach in journalism. After collecting over 100,000 disciplinary documents using scraping, the backgrounder to the investigation explains:
“We then created a computer program based on “machine learning” to analyze each case and, based on keywords, give each a probability rating that it was related to a case of physician sexual misconduct.
“We then read all the documents in over 6,000 cases to determine the nature of each case and board action.”
A similar example comes from SRF in Switzerland, which used supervised learning to help an algorithm identify the characteristics that typify fake Instagram accounts. The video below explains the process neatly.
As well as being used to identify items which meet particular characteristics, supervised learning can be useful in identifying when information differs from those characteristics: the LA Times, for example, used this approach for their story “LAPD underreported serious assaults, skewing crime stats for 8 years“. As their background piece explains:
“The computer program pulled crime data from the previous Times review to learn key words that identified an assault as serious or minor. The algorithm then analyzed nearly eight years of data in search of classification errors.”
The predictability of certain features can itself be the subject of a story: Haluka Maier-Borst’s piece for the FT looks at how closely particular voters’ characteristics correlate with voting behaviour. In this case different models have different levels of accuracy.
Another example from the Atlanta Journal-Constitution used pooled logistic regression in 2015 to forecast whether bills would pass the Georgia legislature. What’s particularly important about this example is that the performance of the model is regularly reviewed (by Jeff Ernsthausen), in public.
And logistic regression was also used by the New York Times’s Daeil Kim as part of an investigation into faulty car air bags.
Unsupervised learning: identifying motifs in Wes Anderson films
Unsupervised learning approaches tend to be used to identify things that are associated with each other, or clusters of features. In other words, it can be useful for categorisation when you are not quite sure what the categories will be.
The applications for journalism are less obvious. But Yannick Assogba‘s striking interactive analysis of Wes Anderson films gives one clue to how they might, for example, inform a new form of arts reporting: instead of identifying visual motifs manually, Assogba lets a machine learning algorithm loose on a sample of 2309 frames from 4 movies to see what it turns up.
“Images of television screens from across 3 movies form a tight cluster. We also see clearer groupings of things like shelves, signage, or the text and titles Anderson frequently puts front and center in his films.”
The article is as much an examination and explanation of how machine learning works, as it is an examination of Wes Anderson films, and is worth reading for that reason alone.
Another useful backgrounder, while not journalistic, is Instagram Engineering’s attempt to use unsupervised learning to understand how emojis are used. If you’ve ever wondered what the difference is between the red heart and a blue, green or pink one, this is just one attempt to answer that question by learning from associations and clusters in the contexts in which they are used.
An analagous journalistic use might be to identify terms that tend to be clustered together in a group of documents as a way of suggesting terms to use in document searches which might not otherwise have been considered.
Curious clusters and pattern recognition
“Curious clusters“, then is a useful way of summing up one of the main applications for unsupervised learning in journalism: this piece, which uses an algorithm to identify suspicious clustering in submissions to a consultation over net neutrality, is another example. It’s not clear whether this was genuine machine learning, but it’s a good example of how unsupervised learning techniques could be used in a similar way.
Another common feature is pattern recognition: Chase Davis‘s 2011 piece for California Watch “Brown signs dozens of bills previously vetoed by Schwarzenegger” sadly seems to have been lost in the move to the Revealnews.org site — but the story is archived on the Wayback Machine. Machine learning was used to identify bills “based on the similarity of their text, not bill titles or sponsors, which often change between sessions,” according to the story — another example of the sort of clustering which unsupervised learning does well.
His 2012 “nights-and-weekends” project with machine learning and quote extraction, (which can be found on Reveal) is another example:
“For training, we fed the algorithm a set of several hundred randomly selected paragraphs from our database of The Bay Citizen content, which we tagged by hand as being quotes or not. For features, we developed a set of about a dozen, of which we ended up using six.
“Using those inputs, NLTK’s maxent implementation uses your choice of optimization algorithms to figure out which features are the most useful, then uses those weighted features to determine whether an unseen paragraph should be classified as a quote.”
Again, transparency is important here: the code is available on GitHub. Similarly, the LA Times’s Anthony Pesce used machine learning and Natural Language Processing (NLP) to extract recipes from articles.
Reinforcement learning: beating you at Rock-Paper-Scissors
Reinforcement learning is a machine learning approach that allows a computer to learn what works best in a particular task. It is most famously used in games — Google’s AlphaGo Zero AI, for example, famously took just 3 days to learn enough about the Chinese game Go to beat a human.
For that reason it’s hard to find applications in journalism. The closest is perhaps the New York Times’s Rock-Paper-Scissors interactive, which “can exploit a person’s tendencies and patterns to gain an advantage over its opponent”. It is a piece of journalism about automation itself.
Perhaps one application of reinforcement learning may lie in factchecking: by calculating an optimal route it might be possible to establish the likelihood that the same person could be in two different places within a certain timespan. But this is a pretty specific use case.
Another is writing itself: reinforcement learning has been used by MSN to identify the most effective headlines (thanks to Nick Diakopoulos for that one).
Other examples suggested via email and Twitter include:
- How ProPublica’s Message Machine Reverse Engineers Political Microtargeting used decision trees — another example of clustering in machine learning: “Decision Trees are useful at finding hidden models in a particular data set, particularly because they produce a human-readable tree of partitions of the data.”
- BuzzFeed News Trained A Computer To Search For Hidden Spy Planes. This Is What We Found used a random forest algorithm — in an R markdown page, including working code, Peter Aldhous describes how the algorithm was trained.
- Flor Coelho rounds up a number of examples in this post.
- This post proposes a fourth category and a matrix (shown above): self-supervised learning
These are just the few examples I can come up with — but if you know of others I will add them here and keep this as a growing list.