Dealing with live data and sentiment analysis: Q&A with The Guardian's Martyn Inglis

As part of the research for my book on online journalism, I interviewed Martyn Inglis about The Guardian’s Blairometer, which measured a live stream of data from Twitter as Tony Blair appeared before the Chilcot inquiry. I’m reproducing it in full here, with permission:

How did you prepare for dealing with live data and sentiment analysis?

I think it was important to be aware of our limitations. We can process a limited amount of data – due to Twitter quotas and so on. This is not a definitive sample. Once we accept that (a) we are not going to rank every tweet and (b) this is therefore going to be a limited exercise it frees us to make concessions that provide an easier technology solution.

Sentiment analysis is hard programatically, given the short time span of the event in which we can do this manually. We had an interface view onto incoming tweets which we had pulled from a twitter search. This allows us to be really accurate in our assessment. This does not work over a long period of time – the Chilcot inquiry is one thing, you couldn’t do it for an event lasting a week or so on.
Many users commented on the accuracy of our sentiment rankings. This is due to the fact that we had eyes on them.

When we tried programatically doing it the results were really patchy. Given the gramatic limitations of a tweet – slang, brevity and so on – semantic engines have a hard job establishing the topics being referred to never mind the sentiment attached to that topic.

Additionally if your topic is Blair you can retrieve tweets about “Selma Blair” and so on. We were searching on text “Blair” not hash tags. This relevancy issue is also easily rectified by using manual rankings.

There are issues of self-censorship and editorial bias here. If I think it’s positive is it really or is it my interpretation? Swearing and so on are easily excluded – but on a more subtle level. Bad language is not always excluded unless particularly offensive or in a context that is unacceptable.

In summary here I think the decision we made to manually process the live feed was crucial to its success. Without that the numbers would have skewed, we would have been unable to publish tweets so easily on the site – you see from the high profile fail cases on ITN and cashgordon what happens there – it gave us the accuracy that gave it some strength as an exercise.

Also be aware to alter what you are looking for as the debate continues. So #blair hashtag may have to become #blairchilcott as the group defines the terms. Still relevant, but you need to be flexible to stay with the relevant conversation.

Also as we were doing it manually it was important to keep processing up so peaks and troughs are avoided (unnatural that is).

Did you learn anything which changed how you did things the second time around? [With Gordon Brown’s appearance]

The second time round was identical to the first. The project was outside of our normal dev process. What we did was take an idea to editorial and implement it in down time on a bigger project, there was neither time nor budget to revisit it. So outside of a couple of minor bug fixes it was identical.

What advice would you give to journalists dealing with live data?

Keep flexible! With the blairometer we changed added functionality a lot throughout the day. Once it was running and other people could see it we had requests for functionality that we added during the course of the day. We provided data for long term graphs, moved things and added things as these requests came in.

What you start off doing – in this case the initial plan was for a swingometer – can be by the end a small part of the end result. What we ended with was a full Guardian page and the results going into the Saturday edition.

The results and their presentation may differ from expectations and when dealing with this live stream the story may change (in this case it didn’t) and drive changes to presentation.

One last point about this which may be outside your remit but which is important regarding this project is its timescales. The idea was Tuesday, development had a day and half and it was live for a matter of hours. It is built in response to a news event to provide extra colour and context, and then disposed of.

This is an exciting development for us, the notion that we can build and throw away software in response to events. Reacting to live events and their corresponding live feeds is one way to help plug the Guardian into the greater web ecosphere.

2 thoughts on “Dealing with live data and sentiment analysis: Q&A with The Guardian's Martyn Inglis

  1. Pingback: links for 2010-05-12 « Onlinejournalismtest's Blog

  2. Pingback: Social Mention – Just don’t mention it | JOUR2722

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.