Tag Archives: cleaning

How to: uncover Excel data only revealed by a drop-down menu

Sometimes an organisation will publish a spreadsheet where only a part of the full data is shown when you select from a drop-down menu. In order to get all the data, you’d have to manually select each option, and then copy the results into a new spreadsheet.

It’s not great.

In this post, I’ll explain some tricks for finding out exactly where the full data is hidden, and  how to extract it without getting Repetitive Strain Injury. Here goes…

The example

fire data dropdown

To get the data from this spreadsheet you have to select 51 different options from a dropdown menu

The spreadsheet I’m using here is pretty straightforward: it’s a list of the populations for each fire and rescue authority in the UK (XLS). These figures are essential for putting any story about fires into context (giving us a per capita figure rather than just whole numbers) — and yet the authority behind the spreadsheet has made it very difficult to extract those numbers. Continue reading


How to: clean a converted PDF using Open Refine

Our initial table

This spreadsheet sent in response to an FOI request appeared to have been converted from PDF format

In a guest post post for OJB, Ion Mates explains how he used OpenRefine to clean up a spreadsheet which had been converted from PDF format. An earlier version of this post was published on his blog.

Journalists rarely get their hands on nice, tidy data: public bodies don’t have an interest in providing information in a structured form. So it is increasingly part of a journalist’s job to get that information into the right state before extracting patterns and stories.

A few months ago I sent a Freedom of Information request asking for the locations of all litter bins in Birmingham. But instead of sending a spreadsheet downloaded directly from their database, the spreadsheet they sent appeared to have been converted from a multiple-page PDF.

This meant all sorts of problems, from rows containing page numbers and repeated header rows, to information split across multiple rows and even pages.

In this post I’ll be taking you through I used the free data cleaning tool OpenRefine (formerly Google Refine) to tackle all these problems and create a clean version of the data. Continue reading

How to: combine multiple rows in a dataset where text is split across them (Open Refine)

When you’ve converted data from a PDF to a spreadsheet it’s not uncommon for text to end up being split across multiple rows, like this: text split across rows In this post I’ll explain how you can use Open Refine to quickly clean the data up so that the text is put back together and you have a single row for each entry. Continue reading

How to: clean up spreadsheet headings that run across multiple rows using Open Refine

Something that infuriates me often with government datasets is the promiscuous heading. This is when a spreadsheet doesn’t just have its headings across one row, but instead splits them across two, three or more rows.

To make matters worse, there are often also extra rows before the headings explaining the spreadsheet more generally. Here’s just one offender from the ONS:

A spreadsheet with promiscuous headings

A spreadsheet with promiscuous headings

To clean this up in Excel takes several steps – but Open Refine (formerly Google Refine) does this much more quickly. In this post I’m going to walk through the five minute process there that can save you unnecessary effort in Excel. Continue reading

Web security for journalists – takeaway tips and review

Web security for journalists - book cover

Early in Alan Pearce‘s book on web security, Deep Web for Journalists, a series of statistics appears that tell a striking story about the spread of surveillance in just one country.

199 is the first: the number of data mining programs in the US in 2004 when 16 Federal agencies were “on the look-out for suspicious activity”.

Just six years later there were 1,200 government agencies working on domestic intelligence programs, and 1,900 private companies working on domestic intelligence programs in the same year.

As a result of this spread there are, notes Pearce, 4.8m people with security clearance “that allows them to access all kinds of personal information”. 1.4m have Top Secret clearance.

But the most sobering figure comes at the end: 1,600 – the number of names added to the FBI’s terrorism watchlist each day.

Predictive policing

This is the world of predictive policing that a modern journalist must operate in: where browsing protesters’ websites, making particular searches, or mentioning certain keywords in your emails or tweets can put you on a watchlist, or even a no-fly list. An environment where it is increasingly difficult to protect your sources – or indeed for sources to trust you.

Alan Pearce’s book attempts to map this world – and outline the myriad techniques to avoid compromising your sources. Continue reading

A sample dirty dataset for trying out Google Refine

I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:

  • Subheadings that are only used once (and you need them in each row where they apply)
  • Odd characters that stand for something else (e.g. a space or ampersand)
  • Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted

It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).

Thanks to Eva Constantaras for suggesting the idea.

UPDATE: Peter Verweij has put together an introduction to some other cleaning techniques here.