Data journalism pt1: Finding data (draft – comments invited)

The following is a draft from a book about online journalism that I’ve been working on. I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations

The first stage in data journalism is sourcing the data itself. Often you will be seeking out data based on a particular question or hypothesis (for a good guide to forming a journalistic hypothesis see Mark Hunter’s free ebook Story-Based Inquiry (2010)). On other occasions, it may be that the release or discovery of data itself kicks off your investigation.

There are a range of sources available to the data journalist, both online and offline, public and hidden. Typical sources include:

national and local government;
bodies that monitor organisations (such as regulators or consumer bodies);
scientific and academic institutions;
health organisations;
charities and pressure groups;
business;
and the media itself.

One of the best places to find UK government data online, for example, is Data.gov.uk, an initiative influenced by its US predecessor Data.gov. Data.gov.uk – launched in January 2010 with the backing of the inventor of the World Wide Web, Sir Tim Berners-Lee – effectively acts as a search engine and index for thousands of sets of data held by a range of government departments, from statistics on the re-offending of juveniles to the Agricultural Price Index. The site also hosts forums for users to discuss their use of the data, examples of applications using data, further information on how to use the data, and technical resources.

At a regional level, local authorities are also releasing information that can be used as part of data journalism projects. The quality, quantity and accessibility of this information varies enormously by council, but there is continuing pressure for improvement in this area.

There are also a number of volunteer projects, such as OpenlyLocal and Mash The State, that make local government data available in as accessible a format as possible, while the organisation MySociety operate a group of websites providing easy access to information ranging from particular politicians’ voting record (TheyWorkForYou) to local problems (FixMyStreet) and information about a particular area’s transport links and beauty (Mapumental). MySociety also runs a petitions website for Downing Street, and websites that allow people to pledge to do something if other people sign up too (PledgeBank), to find groups near you (GroupsNearYou), contact your MP (WriteToThem) or be contacted by them (HearFromYourMP).

Private companies and charities

In the private sector, a number of organisations regularly release data online, from tables and research reports published on company websites to the annual reports that are filed with bodies such as Companies House. Also worth looking at is the web project Companies Open House, which seeks to make company information more easily accessible.

The Charity Commission is an excellent source of information on registered charities, who must file accounts and annual reports with the organisation. The commission also conducts occasional research into the sector.

Regulators, researchers and the media

NHS foundation trusts likewise must file reports to their regulator, Monitor. And you will find similar regulators in other areas such as the Financial Services Authority, Ofcom, Ofwat, Ofqual, the General Medical Council, the General Social Care Council and the Pensions Regulator to name just a few.

For academic and scientific research there are hundreds of specialist journals. Most have online search facilities which will provide access to summaries. To get access to the full paper you will probably need to use the library of a university which has a subscription. For access to a journal on midwifery, for example, your best bet is to give a quick call to the nearest university which teaches courses in that field. Although university libraries increasingly limit access to students, you can request a special pass. For access to the data on which research is based it is likely you will need to contact the author.

Media organisations such as The Guardian and the New York Times publish ‘datablogs’ that regularly release sets of data produced or acquired by investigations, ranging from scientific information about global warming to lists of Oscar winners. These can be a rich source of material for the data journalist, and a great starting point for the beginner as they are often ‘cleaner’ than data from elsewhere.

The Guardian and the New York Times websites are also among an increasing number of web platforms generally which are making their own data available via APIs (Application Programming Interfaces). Typically, websites which offer this access are social networking sites (such as Flickr and Twitter).

Accessing this data typically requires a level of technical ability, but can be particularly useful in measuring activity across social networks (for example sharing and publishing). Even if you don’t have that technical ability, understanding the possibilities can be extremely useful when working with web developers on a data journalism project (see the part of this chapter on mashups for more information on APIs).

Using search engines to find data

If you are using a search engine to find the data you are looking for, you should familiarise yourself with the advanced search facility, where you can often specify the format of the file you are looking for. Searching specifically for spreadsheets (files ending in .xls), for example, is likely to get you to data more quickly. Similarly, official reports can often be found more effectively by searching for PDF format, while Powerpoint presentations (.ppt) will sometimes contain useful tables of data. You can also include ‘XML’ or ‘RDF’ in your search terms if you think your data may be in those or other formats.

Advanced search also allows you to specify the type of website you are searching – those ending in .gov.uk (government), .org and .org.uk (charities), .ac.uk (educational establishments), .nhs, .police.uk and .mod (Ministry of Defence) are just some that will be particularly relevant (you can also specify an individual site – for instance, that of a local council). A basic familiarity with these search techniques – for example limiting your search to spreadsheets on .gov.uk websites – can improve your results.

Live data

Another type of data to think about is live data that is not stored anywhere yet but, rather, will be produced at a particular time. A good example of this would be how newspapers are increasingly using Twitter commentary to provide context to a particular debate. Part of the Guardian’s coverage of Tony Blair’s appearance at the Chilcot Inquiry into the Iraq War, for example, used the data of thousands of Twitter updates (‘tweets’) to provide a ‘sentiment analysis’ timeline of how people reacted to particular parts of his evidence as it went on. Similar timelines have been produced for political debates and speeches to measure public reaction.

Preparation is key to live data projects – where will you get the data from, and how will you filter it? How can you visualise it most clearly? And how do you prevent it being ‘gamed’ (users intentionally skewing the results for fun or commercial or political reasons)?

Legal considerations

Whatever data you are acquiring, you will need to consider whether you have permission to republish that data. Data may be covered by copyright, or may raise issues of data protection or privacy. Even apparently anonymous information can sometimes be traced back to individual users (Barbaro & Zeller, 2006), and while government information is paid for by public money, for example, it is, strictly speaking, often covered by Crown Copyright, while organisations like Ordnance Survey and Royal Mail have been notoriously protective of geographical information and postcodes (see Brooke, 2010).

Books and FOI

Of course, there is also a rich range of data available in books that the data journalist should familiarise themselves with – from books of facts and statistics to almanacs, from the Civil Service Year Book (also online) to volumes like Who’s Who (online at ukwhoswho.com – your library may have a subscription).

Particularly useful is the data held by public bodies which can be accessed through a well-worded Freedom of Information (FOI) request. Heather Brooke’s book Your Right To Know (2007) is a key reference work in this area, and the online tool WhatDoTheyKnow is particularly useful in allowing you to submit FOI requests easily, as well as allowing you to find similar FOI requests and the responses to them.

When requesting data through an FOI request, it is always useful to specify the format that you wish the information to be supplied in – typically a spreadsheet in electronic format. A PDF or Word document, for example, will mean extra work at the next stage: interrogation.

UPDATE: Tim Davies lists a couple of further avenues along these lines:

http://unlockingservice.data.gov.uk/ provides a root for requesting data is opened up by the Data.gov.uk team. It’s not backed by the legal framework of FOI, but may play a role in data requests under the currently debated ‘Right to Data’ legislation.

IsItOpenData.org provides a useful tool for asking non-public bodies to share their data as open data, or to clarify the licensing.”

Once again – this is a draft: I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations. Part 2 – on interrogating data – can be found here.

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

Data journalism pt1: Finding data (draft – comments invited)

Private companies and charities

Regulators, researchers and the media

Using search engines to find data

Live data

Legal considerations

Books and FOI

30 thoughts on “Data journalism pt1: Finding data (draft – comments invited)”

Leave a reply to Neil Cancel reply

Private companies and charities

Regulators, researchers and the media

Using search engines to find data

Live data

Legal considerations

Books and FOI

Share this:

Related

30 thoughts on “Data journalism pt1: Finding data (draft – comments invited)”

Leave a reply to Neil Cancel reply