This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.
UPDATE: It has now been published in The Online Journalism Handbook.
“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)
Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals.
The first stage in this process, then, is making sure the data is in the right format to be interrogated. Quite often this will be a spreadsheet or CSV (comma-separated values) file. If your information is in a PDF you will not be able to do a great deal with it other than re-type the values into a new spreadsheet (making sure to check you have not made any errors). A Word or Powerpoint document is likely to require the same work.
If the information is already online you can sometimes ‘scrape’ it – that is, automatically copy the relevant information into a separate document. How easy this is to do depends on how structured the information is. A table in a Wikipedia entry, for example, can be ‘scraped’ into a Google spreadsheet relatively easily (Tony Hirst gives instructions on how to do this at http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/) and an online CSV file and certain other structured data can be scraped with Yahoo! Pipes (see below for more on using Yahoo! Pipes) but most scraping will involve programming (the online tool ScraperWiki provides one environment to help you do this).
Insert: Cleaning up data
Whether you have been given data, had to scrape it, or copied it manually, you will probably need to clean it up. All sorts of things can ‘dirty’ your data, from misspellings and variations in spelling, to odd punctuation, mixtures of numbers and letters, unnecessary columns or rows, and more. Computers, for example, will see ‘New Town’, ‘Newtown’ and ‘newtown’ as three separate towns when they may be one.
This can cause problems later on when analysing your data – for example, calculations not working or results not being accurate.
Some tips for cleaning your data include:
- Use a spellchecker to check for misspellings. You will probably have to add some words to the computer’s dictionary.
- Use ‘find and replace’ (normally in the Edit menu) to remove double-spaces and other common punctuation errors. Alternatively, if you are in Excel you can create a new column and use the =TRIM() function, which will copy the contents of the cell in the brackets and remove any spaces.
- Remove duplicate entries – if you are using Excel there are a few ways to do this under the Data tab – search for duplicates in Help.
For more tips on Excel specifically see this guide: http://office.microsoft.com/en-us/excel/HA102218401033.aspx
For cleaning up very large sets of data you might want to use a data cleaning tool like Google Refine.
Spotting the story
Once your data is cleaned you can start to look for anything newsworthy in it. There are some obvious places to start: if you are dealing with numbers, for example, you can work out what the ‘average’ is (the average bonus paid to council employees, for example). Similarly, you might look for the term which appears most often (e.g. the most common reason given for arresting terrorist suspects).
However, Kaiser Fung, a statistician whose blog Junk Charts is essential reading on the field, notes the dangers in lazily reaching for the average when you want to make an editorial point:
“Averaging stamps out diversity, reducing anything to its simplest terms. In so doing, we run the risk of oversimplifying, of forgetting the variations around the average. Hitching one’s attention to these variations rather than the average is a sure sign of maturity in statistical thinking. One can, in fact, define statistics as the study of the nature of variability. How much do things change? How large are these variations? What causes them?” (Fung, 2010, p4)
So while averages and modes (a mode is the number or term which appears most often) can be interesting discoveries, they should most often be used as a starting point for more illuminating investigation – normally involving leaving your computer to make phonecalls and speak to sources.
If you are looking at data over time, you can look to see what has increased over that period, or decreased – or disappeared.
But you will need to gather further data to provide context to your figures. If, for example, more council staff are receiving bonuses, is that simply because more staff have been employed? How much is spent on wages, and how do your figures compare? If you are comparing one city with another, understand how their populations differ – not just in aggregate, but in relevant details such as age, ethnicity, life expectancy, etc. You will need to know where to access basic statistics like these – the National Statistics website is often a good place to start.
Sometimes a change in the way data is gathered or categorised can produce a dramatic change in the data itself. In one example, designer Adrian Short obtained information (via an FOI request) on parking tickets from Transport for London that showed the numbers of tickets issued against a particular offence plummeted from around 8,000 to 8 in the space of one month (Arthur, 2009). Had people suddenly stopped committing that parking offence, or was there another explanation? A quick phonecall to Transport for London revealed that traffic wardens were issued with new handsets around the same time. Guardian journalist Charles Arthur hypothesised:
“Could it be that s46 [another offence which had a steep rise at the same time] is the default on the screen to issue a new ticket, and that wardens don’t bother to change it? Whatever it is, there’s a serious problem for TfL if those aren’t all s46 offences which have been ticketed since August 2006. Because if the ticket isn’t written out to the correct offence, then the fine isn’t payable. Theoretically, TfL might have to pay back millions in traffic fines for people who have been ticketed for s46 offences when they were actually committing s25 or s30 offences.”
This particular story came about at least in part because that information was easy to visualise.
The next section covers visualisation. In the meantime, once again I’d really appreciate any additions or comments – particularly around ways of spotting stories in data, and mistakes to avoid.
Pingback: links for 2010-04-26 « Onlinejournalismtest's Blog
“Use a spellchecker to check for misspellings. You will probably have to add some words to the computer’s dictionary.” — I guess these sort of things is why it still pays off to team up with a developer. E.g. in this case levenshtein distances are often enough to allow a computer to clean up dirty data by grouping together slight variations on a spelling, and in other cases simple regular expressions can go where simple find and replace can’t.
Thanks for link to ScraperWiki, looks like a pretty cool service.
Remember this is mainly aimed at journalism students who won’t have the luxury of teaming up with a developer. That said, the end of the chapter does say that being a member of online communities of support will be as important as reading books.
Yes, I’m afraid my remark was more of a loose thought than an actual suggestion for the book chapter. 🙂
I’m happy I don’t have to write it, because you have to strike a delicate balance between whetting students’ appetites (by diving in and doing stuff) and giving them softer skills that pay off in a newsroom (like how to talk to a coder, knowing where to find help if they’re on their own, what kinds of things are and are not possible with data).
I look forward to the final version!
It’s a very difficult chapter. Trying to cover statistics, information design, advanced search, methodology, programming, and a dozen other elements, all of which warrant books in their own right! The summary points them in a lot of other directions for further reading.
When discussing how to spot a story, you could consider giving an example of the differences between mean and median. It is not unusual for a single, extreme outlier to skew the mean of a distribution to the point where it is no longer meaningful.
Means can also be abused by averaging over time: for example, an average increase of 20 cement trucks/day on a road may not seem like a lot until you look at the median increase and see that they all use the road between 7-9 AM. You get a very different picture when you think about sharing the road with 20 trucks spread out over the course of a day than you do imaging the congestion of 20 trucks on your road during rush hour.
Thanks – yes, was thinking of putting some more in about averages. How To Lie With Statistics has a good line on modes, means and medians – will stick in a reference.
Not all PDFs are the same. Sometimes you can successfully copy and paste their content (but you may need to choose the Select tool first). Other times the result can be jumbled and will require editing. Occasionally they are locked to prevent copying, but usually not (in my experience — perhaps yours is different).
Excel has a useful tool for cleaning up data that’s in rows but not in a recognized columnar format; paste into a column and menu Data > Text to Columns.
Word tables can be copied and pasted into Excel. PowerPoint tables probably came from one or the other and generally can be copied and pasted back to them, too.
Excel can read HTML tables, also. Excel 2007 can easily read them online (just paste a URL into the open dialog), but it’s a bit of a hassle in other versions to open them from an online source. In that case, it’s simplest to just save the web page to disk and have Excel open it.
In addition to double spaces, you might want to mention to watch out for extra spaces at the beginning and end of text; they are often easy to miss and may prevent matching in some programs.
That’s all hugely useful – thanks enormously. Will add your name to the acknowledgements page!
Pingback: Periodismo de datos | tejiendo redes
In relation to cleaning data, a method I have always used (i.e. to convert spaces to tabs), is to copy the text/table into a Word Processor.
In View, select All Characters. You are then shown the “invisible” formatting in a visible form.
Simply CTRL-C an example of the formatting you want to replace (typing the symbol does not work) and open a Replace box
Paste the formatting in CTRL-V (it will not be visible in this box) and in the “Replace with” box either leave it blank (to remove all formatting of this kind) or enter a space or tab (by pressing that key on the keyboard)
Replace all, and there you go!
Sounds complicated but is very useful once you get the hang of it!
Pingback: How to be a data journalist | SCOT NETWORK
Pingback: Data journalism pt3: visualising data – charts and graphs (comments wanted) | Online Journalism Blog
Pingback: Data journalism pt1: Finding data (draft – comments invited) | Online Journalism Blog
Pingback: Data journalism pt4: visualising data – tools and publishing (comments wanted) | Online Journalism Blog
Pingback: Data journalism pt1: Finding data (draft – comments invited) | Online Journalism Blog
Paul, why is The Online Journalism Handbook not for sale on Kindle? :O(
The publishers are apparently ‘negotiating’ with libraries, with an ebook version due to be published next year. I think it’s a wider issue, but it has made me think I will self-publish next.