This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.
“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)
Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals.
The first stage in this process, then, is making sure the data is in the right format to be interrogated. Quite often this will be a spreadsheet or CSV (comma-separated values) file. If your information is in a PDF you will not be able to do a great deal with it other than re-type the values into a new spreadsheet (making sure to check you have not made any errors). A Word or Powerpoint document is likely to require the same work.
If the information is already online you can sometimes ‘scrape’ it – that is, automatically copy the relevant information into a separate document. How easy this is to do depends on how structured the information is. A table in a Wikipedia entry, for example, can be ‘scraped’ into a Google spreadsheet relatively easily (Tony Hirst gives instructions on how to do this at http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/) and an online CSV file and certain other structured data can be scraped with Yahoo! Pipes (see below for more on using Yahoo! Pipes) but most scraping will involve programming (the online tool ScraperWiki provides one environment to help you do this).
Insert: Cleaning up data
Whether you have been given data, had to scrape it, or copied it manually, you will probably need to clean it up. All sorts of things can ‘dirty’ your data, from misspellings and variations in spelling, to odd punctuation, mixtures of numbers and letters, unnecessary columns or rows, and more. Computers, for example, will see ‘New Town’, ‘Newtown’ and ‘newtown’ as three separate towns when they may be one.
This can cause problems later on when analysing your data – for example, calculations not working or results not being accurate.
Some tips for cleaning your data include:
- Use a spellchecker to check for misspellings. You will probably have to add some words to the computer’s dictionary.
- Use ‘find and replace’ (normally in the Edit menu) to remove double-spaces and other common punctuation errors. Alternatively, if you are in Excel you can create a new column and use the =TRIM() function, which will copy the contents of the cell in the brackets and remove any spaces.
- Remove duplicate entries – if you are using Excel there are a few ways to do this under the Data tab – search for duplicates in Help.
For more tips on Excel specifically see this guide: http://office.microsoft.com/en-us/excel/HA102218401033.aspx
For cleaning up very large sets of data you might want to use a data cleaning tool like Google Refine.
Spotting the story
Once your data is cleaned you can start to look for anything newsworthy in it. There are some obvious places to start: if you are dealing with numbers, for example, you can work out what the ‘average’ is (the average bonus paid to council employees, for example). Similarly, you might look for the term which appears most often (e.g. the most common reason given for arresting terrorist suspects).
However, Kaiser Fung, a statistician whose blog Junk Charts is essential reading on the field, notes the dangers in lazily reaching for the average when you want to make an editorial point:
“Averaging stamps out diversity, reducing anything to its simplest terms. In so doing, we run the risk of oversimplifying, of forgetting the variations around the average. Hitching one’s attention to these variations rather than the average is a sure sign of maturity in statistical thinking. One can, in fact, define statistics as the study of the nature of variability. How much do things change? How large are these variations? What causes them?” (Fung, 2010, p4)
So while averages and modes (a mode is the number or term which appears most often) can be interesting discoveries, they should most often be used as a starting point for more illuminating investigation – normally involving leaving your computer to make phonecalls and speak to sources.
If you are looking at data over time, you can look to see what has increased over that period, or decreased – or disappeared.
But you will need to gather further data to provide context to your figures. If, for example, more council staff are receiving bonuses, is that simply because more staff have been employed? How much is spent on wages, and how do your figures compare? If you are comparing one city with another, understand how their populations differ – not just in aggregate, but in relevant details such as age, ethnicity, life expectancy, etc. You will need to know where to access basic statistics like these – the National Statistics website is often a good place to start.
Sometimes a change in the way data is gathered or categorised can produce a dramatic change in the data itself. In one example, designer Adrian Short obtained information (via an FOI request) on parking tickets from Transport for London that showed the numbers of tickets issued against a particular offence plummeted from around 8,000 to 8 in the space of one month (Arthur, 2009). Had people suddenly stopped committing that parking offence, or was there another explanation? A quick phonecall to Transport for London revealed that traffic wardens were issued with new handsets around the same time. Guardian journalist Charles Arthur hypothesised:
“Could it be that s46 [another offence which had a steep rise at the same time] is the default on the screen to issue a new ticket, and that wardens don’t bother to change it? Whatever it is, there’s a serious problem for TfL if those aren’t all s46 offences which have been ticketed since August 2006. Because if the ticket isn’t written out to the correct offence, then the fine isn’t payable. Theoretically, TfL might have to pay back millions in traffic fines for people who have been ticketed for s46 offences when they were actually committing s25 or s30 offences.”
This particular story came about at least in part because that information was easy to visualise.
The next section covers visualisation. In the meantime, once again I’d really appreciate any additions or comments – particularly around ways of spotting stories in data, and mistakes to avoid.