When you’ve converted data from a PDF to a spreadsheet it’s not uncommon for text to end up being split across multiple rows, like this: In this post I’ll explain how you can use Open Refine to quickly clean the data up so that the text is put back together and you have a single row for each entry. Continue reading
Something that infuriates me often with government datasets is the promiscuous heading. This is when a spreadsheet doesn’t just have its headings across one row, but instead splits them across two, three or more rows.
To make matters worse, there are often also extra rows before the headings explaining the spreadsheet more generally. Here’s just one offender from the ONS:
To clean this up in Excel takes several steps – but Open Refine (formerly Google Refine) does this much more quickly. In this post I’m going to walk through the five minute process there that can save you unnecessary effort in Excel. Continue reading
Early in Alan Pearce‘s book on web security, Deep Web for Journalists, a series of statistics appears that tell a striking story about the spread of surveillance in just one country.
199 is the first: the number of data mining programs in the US in 2004 when 16 Federal agencies were “on the look-out for suspicious activity”.
Just six years later there were 1,200 government agencies working on domestic intelligence programs, and 1,900 private companies working on domestic intelligence programs in the same year.
As a result of this spread there are, notes Pearce, 4.8m people with security clearance “that allows them to access all kinds of personal information”. 1.4m have Top Secret clearance.
But the most sobering figure comes at the end: 1,600 - the number of names added to the FBI’s terrorism watchlist each day.
This is the world of predictive policing that a modern journalist must operate in: where browsing protesters’ websites, making particular searches, or mentioning certain keywords in your emails or tweets can put you on a watchlist, or even a no-fly list. An environment where it is increasingly difficult to protect your sources – or indeed for sources to trust you.
Alan Pearce’s book attempts to map this world – and outline the myriad techniques to avoid compromising your sources. Continue reading
I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:
- Subheadings that are only used once (and you need them in each row where they apply)
- Odd characters that stand for something else (e.g. a space or ampersand)
- Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted
It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).
Thanks to Eva Constantaras for suggesting the idea.
There have been quite a few tools springing up over the past few months that I’ve not had time to blog about, so here’s a roundup post on all of them – a bumper Something For The Weekend (let me know how you find these).
1. Junar – for scraping websites and sharing data
Junar presents a much easier way to scrape data from online tables with its ‘Collect Data‘ tool – and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs. Continue reading