Data scientist David Robinson was behind one of the most striking data stories of this US election season, when his analysis of Donald Trump tweets appeared to confirm that Trump was posting the angriest comments on that account (jointly managed by his campaign staff). Barbara Maseda spoke to Robinson about the story behind that text analysis and what comes next.
It was August 9 when David Robinson published his analysis of Trump tweets on his blog. Robinson had used a series of libraries in the programming language R to collect, clean, process and visualise the data. The process took just 12 hours, from Saturday night through Tuesday morning.
In the following days, the piece would be re-posted and cited by multiple websites, including The Washington Post and Mashable. The original piece alone had hundreds of thousands of views in just a few days.
The result wasn’t just one election story, but one of the biggest indications yet of the potential of text analysis for journalists, with three takeaways in particular:
- We have reached the point where it is technologically possible to use text analysis to find attention-grabbing stories
- Such stories can be potentially interesting for both editors and audiences; and
- With the right set of skills, we can work fast and meet deadlines
Half a million hits in the first week
Robinson says the article got around 30,000 hits the first day, and 500,000 hits over the following week, peaking on Thursday the 11th with 245,000 hits when a Reddit post made it to the front page.
“The strangest reaction I’ve gotten is a few people accusing me of being paid by the Clinton campaign,” says Robinson.
“Many other Trump supporters simply responded ‘Who cares whether he writes his own tweets,’ which is an entirely reasonable response.
“I found this an interesting insight into the mechanics of the campaign, but it is not really an anti-Trump article. This kind of science can’t really provide conclusions, only context.”
The Washington Post contacted Robinson on Wednesday, August 10th, about repurposing the article for the Washington Outlook section.
“An editor shortened the text with my approval,” he says. “And I provided spreadsheet data behind the graphs so that they could recreate them in the Post’s style. It was a straightforward process and I was very impressed with the editors.
Another newspaper suggested that Robinson re-write the article for publication with them, but he declined due to a lack of time.
Following up on user comments
The comments in the original blog post about the analysis make interesting reading, and Robinson says he is often asked about a possible follow up post.
He says two analyses might be particularly interesting:
- “it would be interesting to see how Hillary Clinton’s tweets differ from the campaign’s. She signs her own tweets, so it would be less of an ‘expose’, but I ran a quick analysis and saw it makes up about 5% of her tweets and many of them were responses on national tragedies. More could be done. (I’ve been communicating with someone who is doing his own analysis of this, and I’ve provided some advice)
- “Many pointed out that the iPhone tweets that sounded like Trump were probably dictation, and it would be worth examining whether they tended to happen during weekday afternoons.”
Robinson says he has also received a number of suggestions by email related to general “analyze the election with statistics”.
“A number of commenters suggested I should analyze newspaper articles to prove a bias in favour of Hillary Clinton. One email suggested I analyze Trump’s medical report to see if a doctor really wrote it, which I doubt is possible.
“However, as I state at the top of my post I’m not a political blogger and I’d rather not be known for it. The fact that the comments (on my post, on the Washington Post article, and on Reddit) devolved into political arguments, and that people from both sides read this as an anti-Trump “hit piece” rather than an interesting analysis, are reasons I’m unlikely to write a follow-up. I’d much rather focus on the science than the politics.
Inspiring data analysis
Robinson says his favourite comments by far are ones like this, by a programmer that said “I’ve never messed with R, but maybe I’ll get my hands dirty this weekend.”
“I’d much rather encourage people to develop data analysis skills than to get attention for my political conclusions. That’s why I included the code in the post, which is probably inaccessible to the vast majority of readers but could be helpful to data scientists and statisticians interested in the methodology.”
Helping others do text analysis
The open source software used to analyze Trump’s tweets is called tidytext, which Robinson co-authored with data scientist Julia Silge. “I care a lot about text mining in general,” he says, “especially developing tools that make the process easier for other data scientists. The analysis of the Trump tweets alone used R libraries including dplyr, purrr, twitteR, tidyr, lubridate, scales, ggplot2 and tidytext.
Now he is writing a book with Julia – Tidy Text Mining in R – that he says will include many examples of text analysis, including sentiment.
“She also did an awesome sentiment analysis of her own tweets, though that was before we collaborated on tidytext.”
Julia has written about using the software for sentiment analysis on the works of Jane Austen. A previous blog post by Robinson also outlines using sentiment analysis on Yelp reviews.
As for Trump? Robinson couldn’t resist another experiment: “I’ve written a much more humorous (and stupid) application of Trump-related sentiment analysis, which is to annotate every sentence in Pride and Prejudice with a Trump-style declaration.”
A version of this interview was conducted as part of the author’s research project “Text data processing and analysis in data journalism. An exploration of tools, techniques and applications” at the MA in Online Journalism at Birmingham City University.