Data journalism pt1: Finding data (draft – comments invited)

The following is a draft from a book about online journalism that I’ve been working on. I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations

The first stage in data journalism is sourcing the data itself. Often you will be seeking out data based on a particular question or hypothesis (for a good guide to forming a journalistic hypothesis see Mark Hunter’s free ebook Story-Based Inquiry (2010)). On other occasions, it may be that the release or discovery of data itself kicks off your investigation.

There are a range of sources available to the data journalist, both online and offline, public and hidden. Typical sources include:

  • national and local government;
  • bodies that monitor organisations (such as regulators or consumer bodies);
  • scientific and academic institutions;
  • health organisations;
  • charities and pressure groups;
  • business;
  • and the media itself.

One of the best places to find UK government data online, for example, is, an initiative influenced by its US predecessor – launched in January 2010 with the backing of the inventor of the World Wide Web, Sir Tim Berners-Lee – effectively acts as a search engine and index for thousands of sets of data held by a range of government departments, from statistics on the re-offending of juveniles to the Agricultural Price Index. The site also hosts forums for users to discuss their use of the data, examples of applications using data, further information on how to use the data, and technical resources.

At a regional level, local authorities are also releasing information that can be used as part of data journalism projects. The quality, quantity and accessibility of this information varies enormously by council, but there is continuing pressure for improvement in this area.

There are also a number of volunteer projects, such as OpenlyLocal and Mash The State, that make local government data available in as accessible a format as possible, while the organisation MySociety operate a group of websites providing easy access to information ranging from particular politicians’ voting record (TheyWorkForYou) to local problems (FixMyStreet) and information about a particular area’s transport links and beauty (Mapumental). MySociety also runs a petitions website for Downing Street, and websites that allow people to pledge to do something if other people sign up too (PledgeBank), to find groups near you (GroupsNearYou), contact your MP (WriteToThem) or be contacted by them (HearFromYourMP).

Private companies and charities

In the private sector, a number of organisations regularly release data online, from tables and research reports published on company websites to the annual reports that are filed with bodies such as Companies House. Also worth looking at is the web project Companies Open House, which seeks to make company information more easily accessible.

The Charity Commission is an excellent source of information on registered charities, who must file accounts and annual reports with the organisation. The commission also conducts occasional research into the sector.

Regulators, researchers and the media

NHS foundation trusts likewise must file reports to their regulator, Monitor. And you will find similar regulators in other areas such as the Financial Services Authority, Ofcom, Ofwat, Ofqual, the General Medical Council, the General Social Care Council and the Pensions Regulator to name just a few.

For academic and scientific research there are hundreds of specialist journals. Most have online search facilities which will provide access to summaries. To get access to the full paper you will probably need to use the library of a university which has a subscription. For access to a journal on midwifery, for example, your best bet is to give a quick call to the nearest university which teaches courses in that field. Although university libraries increasingly limit access to students, you can request a special pass. For access to the data on which research is based it is likely you will need to contact the author.

Media organisations such as The Guardian and the New York Times publish ‘datablogs’ that regularly release sets of data produced or acquired by investigations, ranging from scientific information about global warming to lists of Oscar winners. These can be a rich source of material for the data journalist, and a great starting point for the beginner as they are often ‘cleaner’ than data from elsewhere.

The Guardian and the New York Times websites are also among an increasing number of web platforms generally which are making their own data available via APIs (Application Programming Interfaces). Typically, websites which offer this access are social networking sites (such as Flickr and Twitter).

Accessing this data typically requires a level of technical ability, but can be particularly useful in measuring activity across social networks (for example sharing and publishing). Even if you don’t have that technical ability, understanding the possibilities can be extremely useful when working with web developers on a data journalism project (see the part of this chapter on mashups for more information on APIs).

Using search engines to find data

If you are using a search engine to find the data you are looking for, you should familiarise yourself with the advanced search facility, where you can often specify the format of the file you are looking for. Searching specifically for spreadsheets (files ending in .xls), for example, is likely to get you to data more quickly. Similarly, official reports can often be found more effectively by searching for PDF format, while Powerpoint presentations (.ppt) will sometimes contain useful tables of data. You can also include ‘XML’ or ‘RDF’ in your search terms if you think your data may be in those or other formats.

Advanced search also allows you to specify the type of website you are searching – those ending in (government), .org and (charities), (educational establishments), .nhs, and .mod (Ministry of Defence) are just some that will be particularly relevant (you can also specify an individual site – for instance, that of a local council). A basic familiarity with these search techniques – for example limiting your search to spreadsheets on websites – can improve your results.

Live data

Another type of data to think about is live data that is not stored anywhere yet but, rather, will be produced at a particular time. A good example of this would be how newspapers are increasingly using Twitter commentary to provide context to a particular debate. Part of the Guardian’s coverage of Tony Blair’s appearance at the Chilcot Inquiry into the Iraq War, for example, used the data of thousands of Twitter updates (‘tweets’) to provide a ‘sentiment analysis’ timeline of how people reacted to particular parts of his evidence as it went on. Similar timelines have been produced for political debates and speeches to measure public reaction.

Preparation is key to live data projects – where will you get the data from, and how will you filter it? How can you visualise it most clearly? And how do you prevent it being ‘gamed’ (users intentionally skewing the results for fun or commercial or political reasons)?

Legal considerations

Whatever data you are acquiring, you will need to consider whether you have permission to republish that data. Data may be covered by copyright, or may raise issues of data protection or privacy. Even apparently anonymous information can sometimes be traced back to individual users (Barbaro & Zeller, 2006), and while government information is paid for by public money, for example, it is, strictly speaking, often covered by Crown Copyright, while organisations like Ordnance Survey and Royal Mail have been notoriously protective of geographical information and postcodes (see Brooke, 2010).

Books and FOI

Of course, there is also a rich range of data available in books that the data journalist should familiarise themselves with – from books of facts and statistics to almanacs, from the Civil Service Year Book (also online) to volumes like Who’s Who (online at – your library may have a subscription).

Particularly useful is the data held by public bodies which can be accessed through a well-worded Freedom of Information (FOI) request. Heather Brooke’s book Your Right To Know (2007) is a key reference work in this area, and the online tool WhatDoTheyKnow is particularly useful in allowing you to submit FOI requests easily, as well as allowing you to find similar FOI requests and the responses to them.

When requesting data through an FOI request, it is always useful to specify the format that you wish the information to be supplied in – typically a spreadsheet in electronic format. A PDF or Word document, for example, will mean extra work at the next stage: interrogation.

UPDATE: Tim Davies lists a couple of further avenues along these lines: provides a root for requesting data is opened up by the team. It’s not backed by the legal framework of FOI, but may play a role in data requests under the currently debated ‘Right to Data’ legislation. provides a useful tool for asking non-public bodies to share their data as open data, or to clarify the licensing.”

Once again – this is a draft: I’d really appreciate any additions or comments you can make – particularly around sources of data and legal considerations. Part 2 – on interrogating data – can be found here.


30 thoughts on “Data journalism pt1: Finding data (draft – comments invited)

  1. Jeremy Bante

    So you’ve got plenty on finding other sources of already-gathered data, but surely data journalism must include gathering data on your own. As a reflection of the state of contemporary data journalism practice, this document might be forgiven for excluding it; but I would hope that any book used for instruction or direction would try to improve the field. Otherwise, data journalists are relying on data providers to do most of the investigating for them.

  2. Pingback: links for 2010-04-21 « Onlinejournalismtest's Blog

  3. ojb

    Thanks for reminding me – had meant to include at least some information on survey tools and basic research methodology. This is such a complex chapter it’s easy to overlook things (also covered in the next section on interrogating data is stuff on statistics, etc.)

  4. Pingback: Livro sobre ciberjornalismo : Ponto Media

  5. Pingback: Recommended Links for April 21st | Alex Gamela - Digital Media & Journalism

  6. Pingback: links for 2010-04-21 : The ChipCast || by Chip Mahaney

  7. Pingback: Medial Digital» Linktipps Neu » Linktipps zum Wochenstart: Blauflossenthunfisch-Syndrom

  8. Neil

    Some really good coverage of data sources, and approaches to finding/obtaining it.

    Though, it’s always good to step back and think hard about the question you really want to answer.

    Always work from the question, not the data, as the data you have might not fully answer it etc. [i.e don’t go hunting for ‘interesting’ data unless you really know what you are after]

    You talk about advanced searching (i.e type:PDF ), but don’t mention how to achieve or where to lookup the syntax.

    1. Paul Bradshaw

      Thanks – yes, I should probably say to look for an ‘advanced search’ link. I’ve avoided talking about syntax but perhaps should put a line in explaining the option.

  9. Pingback: Coisas Lidas | *

  10. Pingback: Periodismo de datos | tejiendo redes

  11. Pingback: Si të jetë një gazetar të dhënave | Lajme | | Albanian News And Articles

  12. Pingback: Starting Points « Project Site for DXARTS 511, Fall 2010

  13. Pingback: Something I wrote for the Guardian Datablog (and caveats) | Online Journalism Blog

  14. Pingback: links for 2010-10-15 « Köszönjük, Emese!

  15. Pingback: 1001 Medios » Blog Archive » Periodismo de datos: “Everything is data”

  16. Pingback: Data journalism pt2: Interrogating data | Online Journalism Blog

  17. Pingback: One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic | Online Journalism Blog

  18. Pingback: One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic » Article », Digital Journalism

  19. Pingback: One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic | Colombo Herald

  20. Pingback: L’embarras d’un ambassadeur est une tragédie, 15.000 victimes civiles une statistique » Article » OWNI, Digital Journalism

  21. Pingback: Medios: redes sociales y datos | Hache se escribe con Hache > Blog de Héctor Romero

  22. Pingback: How to be a data journalist « beatsnpeace

  23. Pingback: The inverted pyramid of data journalism | Online Journalism Blog

  24. Pingback: How to be a data journalist | Tech2Crave

  25. Pingback: The inverted pyramid of data journalism – DATOVÁ ŽURNALISTIKA

  26. Pingback: One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic | Online Journalism Blog

  27. Pingback: The inverted pyramid of data journalism | Online Journalism Blog

  28. Pingback: Cómo ser un periodista de datos, cómo iniciarse en el periodismo

Leave a Reply to Jeremy Bante Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.