Tag Archives: cleaning data

Finding Stories in Spreadsheets – ebook now live!

Finding stories in spreadsheets book cover

Cover design by Matt Buck/Drawnalism

My latest ebook – Finding Stories in Spreadsheets – is now live on Leanpub.

As with Scraping for Journalists, I’m publishing the book week-by-week so the book can be updated based on reader feedback, user suggestions and topical developments.

Each week you can download a new chapter covering a different technique for finding stories, from calculating proportions and changes, to combining data, cleaning it up, testing it, and extracting specific details.

There’s also a downloadable spreadsheet at the end of each chapter with a series of exercises to practise that chapter’s technique and find particular stories.

Along the way I tackle some other considerations in telling the story, such as context and background, and the importance of being specific in the language that you use.

If there’s anything you’d like covered in the book let me know. You can also buy the book in a ‘bundle’ with its sister title Data Journalism Heist, which covers quick-turnaround techniques for finding stories in spreadsheets using pivot tables and advanced filters.


Cleaning data using Google Refine: a quick guide

I’ve been focusing so much on blogging the bells and whistles stuff that Google Refine does that I’ve never actually written about its most simple function: cleaning data. So, here’s what it does and how to do it:

  1. Download and install Google Refine if you haven’t already done so. It’s free.
  2. Run it – it uses your default browser.
  3. In the ‘Create a new project’ window click on ‘Choose file‘ and find a spreadsheet you’re working with. If you need a sample dataset with typical ‘dirty data’ problems I’ve created one you can download here.
  4. Give it a project name and click ‘Create project‘. The spreadsheet should now open in Google Refine in the browser.
  5. At the top of each column you’ll see a downward-pointing triangle/arrow. Click on this and a drop-down menu opens with options including Facet; Text filter; Edit cells; and so on.
  6. Click on Edit cells and a further menu appears.
  7. The second option on this menu is Common transforms. Click on this and a final menu appears (see image below).

You’ll see there are a range of useful functions here to clean up your data and make sure it is consistent. Here’s why:

Trim leading and trailing whitespace

Sometimes in the process of entering data, people put a space before or after a name. You won’t be able to see it, but when it comes to counting how many times something is mentioned, or combining two sets of data, you will hit problems, because as far as a computer or spreadsheet is concerned, ” Jones” is different to “Jones”.

Clicking this option will remove those white spaces.

Collapse consecutive whitespace

Likewise, sometimes a double space will be used instead of a single space – accidentally or through habit, leading to more inconsistent data. This command solves that problem.

Unescape HTML entities

At some point in the process of being collected or published, HTML may be added to data. Typically this represents punctuation of some sort. “"” for example, is the HTML code for quotation marks. (List of this and others here).

This command will convert that cumbersome code into the characters they actually represent.

To titlecase/To uppercase/To lowercase

Another common problem with data is inconsistent formatting – occasionally someone will LEAVE THE CAPS LOCK ON or forget to capitalise a name.

This converts all cells in that column to be consistently formatted, one way or another.

To number/To date/To text

Like the almost-invisible spaces in data entry, sometimes a piece of data can look to you like a number, but actually be formatted as text. And like the invisible spaces, this becomes problematic when you are trying to combine, match up, or make calculations on different datasets.

This command solves that by ensuring that all entries in a particular column are formatted the same way.

Now, I’ve not used that command much and would be a bit careful – especially with dates, where UK and US formatting is different, for example. If  you’ve had experiences or tips on those lines let me know.

Other transforms

In addition to the commands listed above under ‘common transforms’ there are others on the ‘Edit cells’ menu that are also useful for cleaning data:

Split / Join multi-valued cells…

These are useful for getting names and addresses into a format consistent with other data – for example if you want to split an address into street name, city, postcode; or join a surname and forename into a full name.

Cluster and edit…

A particularly powerful cleaning function in Google Refine, this looks at your column data and suggests ‘clusters’ where entries are similar. You can then ask it to change those similar entries so that they have the same value.

There is more than one algorithm (shown in 2 drop-down menus: Method and Keying function) used to cluster – try each one in turn, as some pick up clusters that others miss.

If you have any other tips on cleaning data with Google Refine, please add them.

Data journalism pt2: Interrogating data

This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.

UPDATE: It has now been published in The Online Journalism Handbook.

“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)

Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals. Continue reading