The following is an unedited version of an article written for the International Press Institute report ‘Brave News Worlds (PDF)‘
For the past two centuries journalists have dealt in the currency of information: we transmuted base metals into narrative gold. But information is changing.
At first, the base metals were eye witness accounts, and interviews. Later we learned to melt down official reports, research papers, and balance sheets. And most recently our alloys have been diluted by statements and press releases.
But now journalists are having to get to grips with a new type of information: data. And this is a very rich seam indeed.
Data: what, how and why
Data is a broad term so I should define it here: I am not talking here about statistics or numbers in general, because those are nothing new to journalists. When I talk about data I mean information that can be processed by computers.
This is a crucial distinction: it is one thing for a journalist to look at a balance sheet on paper; it is quite another to be able to dig through those figures on a spreadsheet, or to write a programming script to analyse that data, and match it to other sources of information. We can also more easily analyse new types of data, such as live data, large amounts of text, user behaviour patterns, and network connections.
And that, for me, is hugely important. Indeed, it is potentially transformational. Adding computer processing power to our journalistic arsenal allows us to do more, faster, more accurately, and with others. All of which opens up new opportunities – and new dangers. Things are going to change.
We’ve had over 40 years to see this coming. The growth of the spreadsheet and the database from the 1960s onwards kicked things off by making it much easier for organisations – including governments – to digitise information from what they spent our money on to how many people were being treated for which diseases, and where.
In the 1990s the invention of the world wide web accelerated the data at journalists’ disposal by providing a platform for those spreadsheets and databases to be published and accessed by both humans and computer programs – and a network to distribute it.
And now two cultural movements have combined to add a political dimension to the spread of data: the open data movement, and the linked data movement. Journalists should be familiar with these movements: the arguments that they have developed in holding power to account are a lesson in dealing with entrenched interests, while their experiments with the possibilities of data journalism show the way forward.
While the open data movement campaigns for important information – such as government spending, scientific information and maps – to be made publicly available for the benefit of society both democratically and economically, the linked data movement (championed by the inventor of the web, Sir Tim Berners-Lee) campaigns for that data to be made available in such a way that it can be linked to other sets of data so that, for instance, a computer can see that the director of a company named in a particular government contract is the same person who was paid as a consultant on a related government policy document. Advocates argue that this will also result in economic and social benefits.
Concrete results of both movements can be seen in the US and UK – most visibly with the launch of government data repositories Data.gov and Data.gov.uk in 2009 and 2010 respectively – but also less publicised experiments such as Where Does My Money Go? – which uses data to show how public expenditure is distributed – and Mapumental – which combines travel data, property prices and public ratings of ‘scenicness’ to help you see at a glance which areas of a city might be the best place to live based on your requirements.
But there are dozens if not hundreds of similar examples in industries from health and science to culture and sport. We are experiencing an unprecedented release of data – some have named it ‘Big Data’ – and yet for the most part, media organisations have been slow to react.
That is about to change.
The data journalist
Over the last year an increasing number of news organisations have started to wake from their story-centric production lines and see the value of data. In the UK the MPs’ expenses story was seminal: when a newspaper dictates the news agenda for six weeks, the rest of Fleet Street pays attention – and at the core of this story was a million pieces of data on a disc. Since then every serious news organisation has expanded its data operations.
In the US the journalist-programmer Adrian Holovaty has pioneered the form with the data mashup ChicagoCrime.org and its open source offspring Everyblock, while Aron Pilhofer has innovated at the interactive unit at The New York Times, and new entrants from Talking Points Memo to ProPublica have used data as a launchpad for interrogating the workings of government.
To those involved, it feels like heady days. In reality, it’s very early days indeed. Data journalism takes in a huge range of disciplines, from Computer Assisted Reporting (CAR) and programming, to visualisation and statistics. If you are a journalist with a strength in one of those areas, you are currently exceptional. This cannot last for long: the industry will have to skill up, or it will have nothing left to sell.
Because while news organisations for years made a business out of being a middleman processing content between commerce and consumers, and government and citizens, the internet has made that business model obsolete. It is not enough any more for a journalist to simply be good at writing – or rewriting. There are a million others out there who can write better – large numbers of them working in PR, marketing, or government. While we will always need professional storytellers, many journalists are simply factory line workers.
So on a commercial level if nothing else, publishing will need to establish where the value lies in this new environment – and the new efficiencies to make journalism viable.
Data journalism is one of those areas. With a surfeit of public data being made available, there is a rich supply of raw material. The scarcity lies in the skills to locate and make sense of that – whether the programming skills to scrape it and compare it with other sources in the first place, the design flair to visualise it, or the statistical understanding to unpick it.
“The mass market was a hack”: opportunities for the new economy
The technological opportunity is massive. As processing power continues to grow, the ability to interrogate, combine and present data continues to increase. The development of augmented reality provides a particularly attractive publishing opportunity: imagine being able to see local data-based stories through your mobile phone, or indeed add data to the picture through your own activity. The experiments of the past five years will come to see crude in comparison.
And then there is the commercial opportunity. Publishing is for most publishers, after all, not about selling content but about selling advertising. And here also data has taken on increasing importance. The mass market was a hack. As the saying goes: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”
But Google, Facebook and others have used the measurability of the web to reduce the margin of error, and publishers will have to follow suit. It makes sense to put data at the centre of that – while you allow users to drill into the data you have gathered around automotive safety, the offering to advertisers is likely to say “We can display different adverts based on what information the user is interested in”, or “We can point the user to their local dealership based on their location”.
A collaborative future
I’m skeptical of the ability of established publishers to adapt to such a future but, whether they do or not, others will. And the backgrounds of journalists will have to change. The profession has a history of arts graduates who are highly literate but not typically numerate. That has already been the source of ongoing embarrassment for the profession as expert bloggers have highlighted basic errors in the way journalists cover science, health and finance – and it cannot continue.
We will need more journalists who can write a killer Freedom of Information request; more researchers with a knowledge of the hidden corners of the web where databases – the ‘invisible web’ – reside. We will need programmer-journalists who can write a screen scraper to acquire, sort, filter and store that information, and combine or compare it with other sources. We will need designers who can visualise that data in the clearest way possible – not just for editorial reasons but distribution too: infographics are an increasingly significant source of news site traffic.
There is a danger of ‘data churnalism’ – taking public statistics and visualising them in a spectacular way that lacks insight or context. Editors will need the statistical literacy to guard against this, or they will be found out.
And it is not just in editorial that innovation will be needed. Advertising sales will need to experience the same revolution that journalists have experienced, learning the language of web metrics, behavioural advertising and selling the benefits to advertisers.
And as publishers of data too, executives will need to adopt the philosophies of the open data and linked data movements to take advantage of the efficiencies that they provide. The New York Times and The Guardian have both published APIs that allow others to build web services with their content. In return they get access to otherwise unaffordable technical, mathematical and design expertise, and benefit from new products and new audiences, as (in the Guardian’s case) advertising is bundled in with the service. As these benefits become more widely recognised, other publishers will follow.
I have a hope that this will lead to a more collaborative form of journalism. The biggest resource a publisher has is its audience. Until now publishers have simply packaged up that resource for advertisers. But now that the audience is able to access the same information and tools as journalists, to interact with publishers and with each other, they are valuable in different ways.
At the same time the value of the newsroom has diminished: its size has shrunk, its competitive advantage reduced; and no single journalist has the depth and breadth of skillset needed across statistics, CAR, programming and design that data journalism requires. A new medium – and a new market – demands new rules. The more networked and iterative form of journalism that we’ve already seen emerge online is likely to become even more conventional as publishers move from a model that sees the story as the unit of production, to a model that starts with data.