Category Archives: data journalism

Coins Expenditure Database Published by Government – Open Data

(Cross-posted from the Wardman Wire.)

This looks like an excellent start. The Coalition Government has just published the COINS database, which is the detailed database of Government spending:

The release of COINS data is just the first step in the Government’s commitment to data transparency on Government spending.

You can get the database from the data.gov website here. There are explanations to help you get to grips with it here. Continue reading

Dealing with live data and sentiment analysis: Q&A with The Guardian's Martyn Inglis

As part of the research for my book on online journalism, I interviewed Martyn Inglis about The Guardian’s Blairometer, which measured a live stream of data from Twitter as Tony Blair appeared before the Chilcot inquiry. I’m reproducing it in full here, with permission:

How did you prepare for dealing with live data and sentiment analysis?

I think it was important to be aware of our limitations. We can process a limited amount of data – due to Twitter quotas and so on. This is not a definitive sample. Once we accept that (a) we are not going to rank every tweet and (b) this is therefore going to be a limited exercise it frees us to make concessions that provide an easier technology solution.

Sentiment analysis is hard programatically, given the short time span of the event in which we can do this manually. We had an interface view onto incoming tweets which we had pulled from a twitter search. This allows us to be really accurate in our assessment. This does not work over a long period of time – the Chilcot inquiry is one thing, you couldn’t do it for an event lasting a week or so on. Continue reading

UK general election 2010 – online journalism is ordinary

Has online journalism become ordinary? Are the approaches starting to standardise? Little has stood out in the online journalism coverage of this election – the innovation of previous years has been replaced by consolidation.

Here are a few observations on how the media approached their online coverage: Continue reading

Data journalism pt2: Interrogating data

This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.

UPDATE: It has now been published in The Online Journalism Handbook.

“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)

Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals. Continue reading

Data journalism pt1: Finding data (draft – comments invited)

The following is a draft from a book about online journalism that I’ve been working on. I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations

The first stage in data journalism is sourcing the data itself. Often you will be seeking out data based on a particular question or hypothesis (for a good guide to forming a journalistic hypothesis see Mark Hunter’s free ebook Story-Based Inquiry (2010)). On other occasions, it may be that the release or discovery of data itself kicks off your investigation.

There are a range of sources available to the data journalist, both online and offline, public and hidden. Typical sources include:

Continue reading

Telegraph launches powerful election database

The Telegraph have finally launched – in beta – the election database I’ve been waiting for since the expenses scandal broke. And it’s rather lovely.

Starting with the obvious part (skip to the next section for the really interesting bit): the database allows you to search by postcode, candidate or constituency, or to navigate by zooming, moving and clicking on a political map of the UK.

Searches take you to a page on an individual candidate or a constituency. For the former you get a biography, details on their profession and education (for instance, private or state, oxbridge, redbrick or neither), as well as email, website and Twitter page. Not only is there a link to their place in the Telegraph’s ‘Expenses Files’ – but also a link to their allowances page on Parliament.uk. Continue reading

Interview: Nicolas Kayser-Bril, head of datajournalism at Owni.fr

Past OJB contributor Nicolas Kayser-Bril is now in charge of datajournalism at Owni.fr, a recently launched news site that defines itself as an “open think-tank”.

“Acting as curators, selecting and presenting content taken deep in the immense and self-expanding vaults of the internet,” explains Nicolas, “the Owni team links to the best and does the rest.”

I asked Nicolas 2 simple questions on his work at Owni. Here are his responses:

What are you trying to do?

What we do is datajournalism. We want to use the whole power of online and computer technologies to bring journalism to a new height, to a whole new playing field. The definition remains vague because so little has been made until now, but we don’t want to limit ourselves to slideshows, online TV or even database journalism.

Take the video game industry, for instance. In the late 1970’s, a personal computer could be used to play Pong clones or text-based games. Since then, a number of genres have flourished, taking action games to 3D, building an ever-more intelligent AI for strategy games, etc. In the age of the social web, games were quick to use Facebook and even Twitter.

Take the news industry. In the late 1970’s, you could read news articles on your terminal. In the early 2010’s you can, well… read articles online! How innovative is that? (I’m not overlooking the innovations you’ll be quick to think of, but the fact remains that most online news content are articles.)

We want to enhance information with the power of computers and the web. Through software, databases, visualizations, social apps, games, whatever, we want to experiment with news in ways traditional and online media haven’t done yet.

What have you achieved?

We started to get serious about this in February, when I joined the mother company (22mars) full-time. In just a month, we have completed 2 projects

The first one, dubbed Photoshop Busters (see it here), gives users digital forensics tools to assess the authenticity of an image. It was made as a widget for one of our partners, LesInrocks.com.

More importantly, we made a Facebook app, Where do I vote? There, users can find their polling station and their friends’ for the upcoming regional election in France.

It might sound underwhelming, but it required finding and locating the addresses of more than 35,000 polling stations.

On top of convincing a reluctant administration to hand over their files, we set up a large crowdsourcing effort to convert the documents from badly scanned PDFs to computer-readable data. More than 7,000 addresses have been treated that way.

Dozens of other ideas are in the works. Within Owni.fr, we want to keep the ratio of developers/non-developers to 1, so as to be able to go from idea to product very quickly. I code most of my ideas myself, relying on the team for help, ideas and design.

In the coming months, we’ll expand our datajournalism activities to include another designer, a journalist and a statistician. Expect more cool stuff from Owni.fr.

How to make interactive geographical timelines using Google Calendar and Yahoo Pipes

I was recently given a task where my job was to create a calendar holding around 50 events. Each event also needed to be mapped, and have a corresponding blog post.

Mapping calendar entries made me think, if this could be used for other stuff than simply putting events on a map, – which is quite useful in it’s own way. I thought it would be cool if you could create an interactive map-timeline, controlled dynamically by a (shared)calendar.

Yahoo Pipes by default uses Yahoo Maps, which is great when it comes to narratives. As you can see from the map below (If you don’t see it, click here), each entry has a little arrow that let’s you navigate from marker to marker in a specific order. Each marker also has a number indicating it’s place in a sequence. This is nothing more than entries in a Google Calender with time/date stamps, geo info and a description, mapped automatically using Yahoo Pipes.

{“pipe_id”:”ed13a198a2a83050dd4ace10d12eae16″,”_btype”:”map”,”pipe_params”:{“Curl”:”http://kaspersorensen.com/wp-content/uploads/files/icalyahoopipes.ics”}}

Here’s how you do it.

1. Create a Google Calendar

Simply go to your Google Calendar and create or import a new calendar. You can do this from the settings page under calendars.

2. Make it public

You need to make the calendar public, otherwise Yahoo Pipes won’t have access to it. You can do this while you create it, or afterwards by ticking the box ‘Make this calendar public’ from the sharing settings on your specific calendar. To access the settings for a specific calendar, you click the little arrow in the box on the left hand side that contains your calendars (My Calendars).

3. Create events

Now you simply start adding events to your calendar. Specify what happened, where it happened, when and add the description. You don’t have to add the entries chronologically, they will be sorted by date/time automatically.

4. Feed the iCal file to the Pipe

Go to your calendar settings page, not the general Calendar settings, but the settings for your specific calendar. You will see a section called ‘Calendar address’ with three buttons. Click the green ICAL button and copy the link that pops up. Now go to Mapping Google Calendar Events Pipe and paste it into the ‘Calendar iCal URL’ field and hit ‘Run Pipe’. – Your events are now mapped.

5. Embed on your website

To embed the timeline/map on your website, simply select ‘Get as badge’ just above the map. This will allow you to insert it on your blog or website.

I’m sure there are ways to make this more stable. So if you know how to optimize the pipe, please feel free to do so and let me know.

As Google Maps is already a part of Google Calendar, you would think that there was a nifty way to quickly put a whole calendar on a map, but no. And after failing to use what looked like a saviour, I bumped into a post by Tony Hurst on how to display Google Calendar events on a Google Map. Unfortunately it turns out that the XML feed Tony uses, only parses the 25 most recent calendar entries.

Google Calendar releases their event-entries in iCal format which contains all events. And with a little customization of Tony’s pipe, I managed to come up with a way to map all events from a calendar.

I think this could be potentially useful for developing stories, especially if you can collaborate on the calendar. You end up with data that can be used for nearly anything, not just maps. And if locations aren’t relevant for the story, you could simply take your iCal file and make a normal timeline.

Data and the future of journalism panel discussion: Linked Data London

Tonight I had the pleasure of chairing an extremely informative panel discussion on data and the future of journalism at the first London Linked Data Meetup. On the panel were:

What follows is a series of notes from the discussion, which I hope are of some use.

For a primer on Linked Data there is A Skim-Read Introduction to Linked DataLinked Data: The Story So Far PDF) by Tom Heath, Christian Bizer and Berners-Lee; and this TED video by Sir Tim Berners-Lee (who was on the panel before this one).

To set some brief context, I talked about how 2009 was, for me, a key year in data and journalism – largely because it has been a year of crisis in both publishing and government. The seminal point in all of this has been the MPs’ expenses story, which both demonstrated the power of data in journalism, and the need for transparency from government – for example, the government appointment of Sir Tim Berners-Lee, seeking developers to suggest things to do with public data, and the imminent launch of Data.gov.uk around the same issue.

Even before then the New York Times and Guardian both launched APIs at the beginning of the year, MSN Local and the BBC have both been working with Wikipedia and we’ve seen the launch of a number of startups and mashups around data including Timetric, Verifiable, BeVocal, OpenlyLocal, MashTheState, the open source release of Everyblock, and Mapumental.

Q: What are the implications of paywalls for Linked Data?

The general view was that Linked Data – specifically standards like RDF – would allow users and organisations to access information about content even if they couldn’t access the content itself. To give a concrete example, rather than linking to a ‘wall’ that simply requires payment, it would be clearer what the content beyond that wall related to (e.g. key people, organisations, author, etc.)

Leigh Dodds felt that using standards like RDF would allow organisations to more effectively package content in commercially attractive ways, e.g. ‘everything about this organisation’.

Q: What can bloggers do to tap into the potential of Linked Data?

This drew some blank responses, but Leigh Dodds was most forthright, arguing that the onus lay with developers to do things that would make it easier for bloggers to, for example, visualise data. He also pointed out that currently if someone does something with data it is not possible to track that back to the source and that better tools would allow, effectively, an equivalent of pingback for data included in charts (e.g. the person who created the data would know that it had been used, as could others).

Q: Given that the problem for publishing lies in advertising rather than content, how can Linked Data help solve that?

Dan Brickley suggested that OAuth technologies (where you use a single login identity for multiple sites that contains information about your social connections, rather than creating a new ‘identity’ for each) would allow users to specify more specifically how they experience content, for instance: ‘I only want to see article comments by users who are also my Facebook and Twitter friends.’

The same technology would allow for more personalised, and therefore more lucrative, advertising.

John O’Donovan felt the same could be said about content itself – more accurate data about content would allow for more specific selling of advertising.

Martin Belam quoted James Cridland on radio: “[The different operators] agree on technology but compete on content”. The same was true of advertising but the advertising and news industries needed to be more active in defining common standards.

Leigh Dodds pointed out that semantic data was already being used by companies serving advertising.

Other notes

I asked members of the audience who they felt were the heroes and villains of Linked Data in the news industry. The Guardian and BBC came out well – The Daily Mail were named as repeat offenders who would simply refer to “a study” and not say which, nor link to it.

Martin Belam pointed out that The Guardian is increasingly asking itself ‘How will that look through an API’ when producing content, representing a key shift in editorial thinking. If users of the platform are swallowing up significant bandwidth or driving significant traffic then that would probably warrant talking to them about more formal relationships (either customer-provider or partners).

A number of references were made to the problem of provenance – being able to identify where a statement came from. Dan Brickley specifically spoke of the problem with identifying the source of Twitter retweets.

Dan also felt that the problem of journalists not linking would be solved by technology. In conversation previously, he also talked of “subject-based linking” and the impact of SKOS and linked data style identifiers. He saw a problem in that, while new articles might link to older reports on the same issue, older reports were not updated with links to the new updates. Tagging individual articles was problematic in that you then had the equivalent of an overflowing inbox.

(I’ve invited all 4 participants to correct any errors and add anything I’ve missed)

Finally, here’s a bit of video from the very last question addressed in the discussion (filmed with thanks by @countculture):

Linked Data London 090909 from Paul Bradshaw on Vimeo.

Data and the future of journalism: what questions should I ask?

Tomorrow I’m chairing a discussion panel on the Future of Journalism at the first London Linked Data Meetup. On the panel are:

What questions would you like me to ask them about data and the future of journalism?