Category Archives: data journalism

The Guardian kicks off the local data landgrab

Tonight I’ve been speaking at a Guardian-sponsored event in Birmingham: a special meetup of the Birmingham Social Media Cafe doubling as a sort-of-build-up-to-a-Hack Day.

And I think it’s a very significant event indeed.

For years I’ve lectured newspaper execs on the value of data and why they needed to get their APIs in order.

Now The Guardian is about to prove just why it is so important, and in the process take first-mover advantage in an area the regionals – and maybe even the BBC – assumed was theirs.

This shouldn’t be a surprise to anyone: The Guardian has long led the way in the UK on database journalism, particularly with its Data Blog and this year’s Open Platform. But this initial move into regional data journalism is a wise one indeed: data becomes more relevant the more personal it is, and local data just tends to be more personal.

Reaching out to those with access to that data, and the ability and knowledge to pick through it, makes perfect sense. But it also means treading on regional toes, and it will be interesting to see how (and indeed if) regional newspapers and broadcasters react.

Cobbling together some sort of regional API would be a welcome start – but is not going to be enough alone: The Guardian have spent years building a reputation in technology circles for their understanding of the web. As The Guardian’s Michael Brunton-Spall pointed out tonight, theirs is the only newspaper to offer ‘full fat’ RSS feeds that allow you to read full articles on an RSS reader – not to mention customisable URLs that allow you to build your own feeds based on combinations of tags, authors and categories. And Open Platform is one of the most, well – open news platforms in the world.

So if other news operations want to compete in this arena, they’ll need to make cultural efforts, not just technical ones.

There are few people in those organisations who truly understand why they should want to compete. They may see it in the context of the mutterings about a move by Guardian Media Group (GMG) into hyperlocal media, but that could be a different kettle of fish entirely (a red herring of sorts if you want to mix metaphors).

These early moves on the data side of things are about more than the prospect of launching competing web publications. It means the Guardian (rather than the GMG) is well positioned to provide a platform for a bottom-up network of hyperlocal sites, to become, in short, a Press Association for the 21st century, catering for a grassroots journalism movement filling ever-increasing holes in the regional news map: not just feeding national and international news to local and specialist websites, but pulling data the other way (although that doesn’t mean there isn’t scope to meet GMG hyperlocal plans in the middle). They have competition here from MSN Local and Reuters’ Open Calais, but I’ve not seen evidence of the same cultural efforts from that direction.

It’s very early days, but things move fast in this sphere. A cry is being taken up that all news organisations need to heed: “Raw data now!“.

Add context to news online with a wiki feature

In journalism school you’re told to find the way that best relates a story to your readers. Make it easy to read and understand. But don’t just give the plain facts, also find the context of the story to help the reader fully understand what has happened and what that means.

What better way to do that than having a Wikipedia-like feature on your newspaper’s web site? Since the web is the greatest causer of serendipity, says Telegraph Communities Editor Shane Richmond, reading a story online will often send a reader elsewhere in search of more context wherever they can find it.

Why can’t that search start and end on your web site?

What happens today

Instead of writing this out, I’ll try to explain this with a situation:

While scanning the news on your newspaper’s web site, one story catches your eye. You click through and begin to read. It’s about a new shop opening downtown.

As you read, you begin to remember things about what once stood where the new shop now is. You’re half-way through the story and decide you need to know what was there, so you turn to your search engine of choice and begin hunting for clues.

By now you’ve closed out the window of the story you were reading and are instead looking for context. You don’t return to the web site because once you find the information you were looking for, you have landed on a different news story on a different news web site.

Here’s what the newspaper has lost as a result of the above scenario: Lower site stickiness, fewer page views, fewer uniques (reader could have forwarded the story onto a friend), and a loss of reader interaction through potential story comments. Monetarily, this all translates into lower ad rates that you can charge. That’s where it hurts the most.

How it could be

Now here’s how it could be if a newspaper web site had a wiki-like feature:

The story about the new shop opening downtown intrigues you because, if memory serves, something else used to be there years ago. On the story there’s a link to another page (additional page views!) that shows all of the information about that site that is available in public records.

You find the approximate year you’re looking for, click on it, and you see that before the new shop appeared downtown, many years ago it was a restaurant you visited as a child.

It was owned by a friend of your father’s and it opened when you were six years old. Since you’re still on the newspaper web site (better site stickiness!), you decide to leave a comment on the story about what was once there and why it was relevant to you (reader interaction!). Then you remember that a friend often went there with you, so you email it to them (more uniques!) to see if they too will remember.

Why it matters to readers

For consumers, news is the pursuit of truth and context. Both the news organization and the journalists it employs are obligated to give that to them. The hardest part of this is disseminating public records and putting it online.

The option of crowd-sourcing it, much like Wikipedia does with its records, could work out well. However just the act of putting public records online in a way that makes theme contextually relevant would be a big step forward. It’s time consuming, however the rewards are great.

Newspapers on Twitter – how the Guardian, FT and Times are winning

National newspapers have a total of 1,068,898 followers across their 120 official Twitter accounts – with the Guardian, Times and FT the only three papers in the top 10. That’s according to a massive count of newspaper’s twitter accounts I’ve done (there’s a table of all 120 at that link).

The Guardian’s the clear winner, as its place on the Twitter Suggested User List means that its @GuardianTech account has 831,935 followers – 78% of the total …

@GuardianNews is 2nd with 25,992 followers, @TimesFashion is 3rd with 24,762 and @FinancialTimes 4th with 19,923.

Screenshot of the data

Screenshot of the data

Other findings

  • Glorified RSS Out of 120 accounts, just 16 do something other than running as a glorified RSS feed. The other 114 do no retweeting, no replying to other tweets etc (you can see which are which on the full table).
  • No following. These newspaper accounts don’t do much following. Leaving GuardianTech out of it, there are 236,963 followers, but they follow just 59,797. They’re mostly pumping RSS feeds straight to Twitter, and  see no reason to engage with the community.
  • Rapid drop-off There are only 6 Twitter accounts with more than 10,000 followers. I suspect many of these accounts are invisible to most people as the newspapers aren’t engaging much – no RTing of other people’s tweets means those other people don’t have an obvious way to realise the newspaper accounts exist.
  • Sun and Mirror are laggards The Sun and Mirror have work to do – they don’t seem to have much talent at this so far and have few accounts with any followers. The Mail only seems to have one account but it is the 20th largest in terms of followers.

The full spreadsheet of data is here (and I’ll keep it up to date with any accounts the papers forgot to mention on their own sites)… It’s based on official Twitter accounts – not individual journalists’. I’ve rounded up some other Twitter statistics if you’re interested.

ABCe: please sort out your terrible website (again)

In March, I appealed to the Audit Bureau of Circulations to sort out its terrible ABCe website. It’s had a redesign. Here’s a list of its latest problems (originally published here).

If at any point the ABC wants to pay me a consultancy fee, for all this free advice, just leave me a comment to tell me how to receive my money …

All the URLs have changed but there are no redirects

New ABCe homepage in Google

New ABCe homepage in Google

They’ve had a redesign, but they haven’t redirected the old URLs to new ones. So, for instance, if you click the second link shown in Google for a search on ABCe, you get page not found.

Lesson When relaunching a website, always 301 redirect your old pages to new ones (even if they’re all just to your new home page). That way, external links still work and you keep the SEO benefit of any links.

They haven’t sorted www vs non www

The more observant will have noticed that the title of the first result in that screenshot says ‘To access IIS Help’. The ABC hasn’t realised that abce.org.uk is not the same URL as http://www.abce.org.uk. And if you go to the ABC URLs without www, you get page not found or server errors.

Compare these pages:

and these ones:

Lesson When you set up your website, redirect yourdomain.co.uk/whatever to http://www.yourdomain.co.uk/whatever. And log in to your google webmaster account to set your preferred domain (www or non-www).

They’re running two absolutely identical websites

ABCs new homepage. No, it's ABCe's. No, it's aaaaggghhh

ABCs new homepage. No, it's ABCe's. No, it's aaaaggghhh

You can access the entire website at www.abc.org.uk – or you can see an identical website at www.abce.org.uk. Continue reading

Telegraph plans to expand MPs database site in build up to election (Q&A)

I asked Tim Rowell, Digital Publisher at Telegraph.co.uk 3 questions about how they dealt with the MPs expenses story online. The main headline is that the new domain hosting the expenses database – parliament.telegraph.co.uk -will expand in the run-up to the next election along with the MP expenses database itself.

There are also curious “legal reasons” given for disabling the embed/email option on the PDFs. I’m pushing on that because I don’t see how publication on your site is different from allowing someone to embed it on their own, or email it. If you have any insight on that, let me know. [See response below]

Here are the responses in full:

When the team was going through the expenses and reporting, how was this longer term online strategy incorporated?

From day one, it was agreed that we would work towards the publication of an online database that contained not only the files themselves but also an aggregation of publicly available data (Parliament Parser, They Work for You, Register of Members Interests etc.) with our own unique data analysis.

The publication by Parliament last week of the redacted files has provided a glimpse into the scale of operation required to analyse such a volume of documentation but one has to realise that the full files contain many, many more pages.

The launch yesterday of the database is the first phase. We will, in due course, publish the full uncensored files for all 646 MPs. Crucially, the expenses investigative team of reporters spent a week aggregating and processing the data (the unique 2007/8 analysis of the Additional Costs Allowance) themselves. Integration in action again! The end result of that work is the first accurate breakdown of those ACA figures. We soon realised that this data provided a great basis upon which to build the Complete Expenses Files supplement in last Saturday’s newspaper.

Why Issuu? And why is the ’email/embed’ option disabled for “secret documents”?

“Secret documents’ is not our term, it is Issuu’s. We think Issuu is a great product and that it provides a fantastic user experience and have plans to use it more extensively. But for legal reasons we need to be sure that the document cannot be downloaded. By disabling the download function, Issuu automatically restricts email/embed.

[further to that:]  How is publication on your site different from allowing someone to embed it on their own, or emailing it?

It is a precautionary measure. In the unlikely event that one of the source documents puts at risk the identity of a supplier or the full postcode of an MP we need to be confident that a) we can amend that file immediately and b) that the file has not been distributed more widely. For that reason, we do not want the files to be downloadable. We’d be very happy for other to embed the files in their pages but if you restrict the download option in Issuu you restrict the ability to embed.

Am I right in thinking the pages on each MP are static and so indexable by search engines, even though they’re generated from a database?

Yes. You may also notice that it is on a new domain parliament.telegraph.co.uk. We will be enhancing our political resources over the coming months as we build up to the General Election. This application is not just for the Expenses files, we have plans to develop this area into a full service that enables our users to engage more closely with the democratic process.

MPs expenses data: now it’s The Telegraph’s turn

The Telegraph have finally published their MPs’ expenses data online – and it’s worth the wait. Here are some initial thoughts and reactions:

  • Firstly, they’ve made user behaviour an editorial feature. In plain English: they’re showing the most searched-for MPs and constituencies, which is not only potentially interesting in itself, but also makes it easier for the majority of users who are making those searches (i.e. they can access it with a click rather than by typing)
  • There’s also a table for most expensive MPs. As this is going to remain static, it would be good to see a dedicated page with more information – in the same way the paper did in its weekend supplement.
  • The results page for a particular MP has a search engine-friendly URL. Very often, database-generated pages have poor search engine optimisation, partly because the URLs are full of digits and symbols, and partly because they are dynamically generated. This appears to avoid both problems – the URL for the second home allowance of Khalid Mahmood MP, for example, is http://parliament.telegraph.co.uk/mpsexpenses/second-home/Khalid-Mahmood/mp-11087
  • The uncensored expenses files themselves are embedded using Issuu. This seems a strange choice as it doesn’t allow users to tag or comment – and the email/embed option is disabled for “secret documents”
  • There’s some nice subtle animation on the second home part of expenses, and clear visualisation on other parts.
  • The MP Details page is intelligently related both to the Telegraph site (related articles) and the wider web, with the facility to easily email that MP, go to their Wikipedia entry, and ‘bookmark’.
  • Joy of joys, you can also download the MPs expenses spreadsheet from here (on Google Docs) – although this is for all MPs rather than the one being viewed. Curiously, while viewing you can see who else is viewing and even (as I did) attempt to chat (no, they didn’t chat back).

I’ll most likely update this post later as I get some details from behind the curtain.

And there are more general thoughts around the online treatment of expenses generally which I’ll try to blog at another point.

USA Today’s awesome jobs forecast interactive

 

USA Today interactive - click for larger image

USA Today interactive - click for larger image

Here’s a hugely rich interactive from USA Today which does a number of things very well.

Firstly, it’s an intelligent use of resources: the recession is likely to last for some time, and be the biggest ongoing story of our time. With everyone talking about it, you need something with that ‘wow’ factor, that will not only attract a great deal of attention now, but also a long tail of repeat visits.

Secondly, it’s personalised – not only can you get information on jobs growth in your state, but your particular industry in your state.

Thirdly, it’s dynamic – the graphic promises to be updated each month “with revised data from Moody’s Economy.com.”

There’s one major element missing – interaction. Find a way to capture users’ experiences (value) and you have an extra dimension that really capitalises on all the attention your interactive is getting.

Still, I’m not complaining…

Adding value to the archives: Suburbified.com mashes up NYT real estate articles

Want to know the value of opening up your article databases and APIs? Suburbified is one of the first mashups created using the New York Times’ recently opened API.

suburbified

suburbified

Here’s what it does, according to KillerStartUps: Continue reading

Sport and data – now it’s more than just ‘interactive’

I’ve written previously on the Online Journalism Blog about ‘Why fantasy football may hold the key to the future of news‘. Now it seems The Guardian has taken things up a notch with the wonderful Chalkboard feature: an interactive database-driven toolkit that allows you to create your own ‘chalkboards’ illustrating whatever point you may wish to make about a team or player’s performance. Here’s my first attempt below:

Cute, yes? But more than just cute. This is an idea that takes sports data and makes it more than just ‘interactive’. This makes it communicative

Because you are not just toying with data but creating it to make a point. Once you create a chalkboard it is published to everyone, with space for comments. You can send it, share it or embed it – as I have.

Clearly there are improvements that can be made – starting with searchability/findability from the chalkboard/team page and the odd bug (the description which I entered was not visible on the test I did above, and limiting it to the final 15 minutes does not seem to have worked – you still see all passes).

But really that would be picking holes in what is a beautifully thought-through piece of work – a piece of work that understands if you’re to make news work online it has to be as much a platform as a destination (a platform which in turn opens up plenty of opportunities for monetisation).

The site claims match stats will be available 15 minutes after the full time whistle. Suddenly the calls to local radio to bemoan the manager’s tactics seem one-dimensional. And spending 60 seconds reading the match report is nothing compared to the time that will be spent carefully constructing your argument as to why your star midfielder should not have been sold to that close relegation rival…

Thanks to Alex Lockwood for the tip-off.

The future of investigative journalism: databases and algorithms

There’s a great article over at Miller-McCune on investigative journalism and what you might variously call computer assisted reporting and database journalism. Worth reading in full, the really interesting stuff comes further in, which I’ve quoted below in full:

“Bill Allison, a senior fellow at the Sunlight Foundation and a veteran investigative reporter and editor, summarizes the nonprofit’s aim as “one-click” government transparency, to be achieved by funding online technology that does some of what investigative reporters always have done: gather records and cross-check them against one another, in hopes of finding signs or patterns of problems

“… Before he came to the Sunlight Foundation, Allison says, the notion that computer algorithms could do a significant part of what investigative reporters have always done seemed “far-fetched.” But there’s nothing far-fetched about the use of data-mining techniques in the pursuit of patterns. Law firms already use data “chewers” to parse the thousands of pages of information they get in the discovery phase of legal actions, Allison notes, looking for key phrases and terms and sorting the probative wheat from the chaff and, in the process, “learning” to be smarter in their further searches.

“Now, in the post-Google Age, Allison sees the possibility that computer algorithms can sort through the huge amounts of databased information available on the Internet, providing public interest reporters with sets of potential story leads they otherwise might never have found. The programs could only enhance, not replace, the reporter, who would still have to cultivate the human sources and provide the context and verification needed for quality journalism. But the data-mining programs could make the reporters more efficient — and, perhaps, a less appealing target for media company bean counters looking for someone to lay off. “I think that this is much more a tool to inform reporters,” Allison says, “so they can do their jobs better.”

“… After he fills the endowed chair for the Knight Professor of the Practice of Journalism and Public Policy Studies, [James] Hamilton hopes the new professor can help him grow an academic field that provides generations of new tools for the investigative journalist and public interest-minded citizen. The investigative algorithms could be based in part on a sort of reverse engineering, taking advantage of experience with previous investigative stories and corruption cases and looking for combinations of data that have, in the past, been connected to politicians or institutions that were incompetent or venal. “The whole idea is that we would be doing research and development in a scalable, open-source way,” he says. “We would try to promote tools that journalists and others could use.”

Hat tip to Nick Booth