Cleaning data using Google Refine: a quick guide

I’ve been focusing so much on blogging the bells and whistles stuff that Google Refine does that I’ve never actually written about its most simple function: cleaning data. So, here’s what it does and how to do it:

  1. Download and install Google Refine if you haven’t already done so. It’s free.
  2. Run it – it uses your default browser.
  3. In the ‘Create a new project’ window click on ‘Choose file‘ and find a spreadsheet you’re working with. If you need a sample dataset with typical ‘dirty data’ problems I’ve created one you can download here.
  4. Give it a project name and click ‘Create project‘. The spreadsheet should now open in Google Refine in the browser.
  5. At the top of each column you’ll see a downward-pointing triangle/arrow. Click on this and a drop-down menu opens with options including Facet; Text filter; Edit cells; and so on.
  6. Click on Edit cells and a further menu appears.
  7. The second option on this menu is Common transforms. Click on this and a final menu appears (see image below).

You’ll see there are a range of useful functions here to clean up your data and make sure it is consistent. Here’s why:

Trim leading and trailing whitespace

Sometimes in the process of entering data, people put a space before or after a name. You won’t be able to see it, but when it comes to counting how many times something is mentioned, or combining two sets of data, you will hit problems, because as far as a computer or spreadsheet is concerned, ” Jones” is different to “Jones”.

Clicking this option will remove those white spaces.

Collapse consecutive whitespace

Likewise, sometimes a double space will be used instead of a single space – accidentally or through habit, leading to more inconsistent data. This command solves that problem.

Unescape HTML entities

At some point in the process of being collected or published, HTML may be added to data. Typically this represents punctuation of some sort. “"” for example, is the HTML code for quotation marks. (List of this and others here).

This command will convert that cumbersome code into the characters they actually represent.

To titlecase/To uppercase/To lowercase

Another common problem with data is inconsistent formatting – occasionally someone will LEAVE THE CAPS LOCK ON or forget to capitalise a name.

This converts all cells in that column to be consistently formatted, one way or another.

To number/To date/To text

Like the almost-invisible spaces in data entry, sometimes a piece of data can look to you like a number, but actually be formatted as text. And like the invisible spaces, this becomes problematic when you are trying to combine, match up, or make calculations on different datasets.

This command solves that by ensuring that all entries in a particular column are formatted the same way.

Now, I’ve not used that command much and would be a bit careful – especially with dates, where UK and US formatting is different, for example. If  you’ve had experiences or tips on those lines let me know.

Other transforms

In addition to the commands listed above under ‘common transforms’ there are others on the ‘Edit cells’ menu that are also useful for cleaning data:

Split / Join multi-valued cells…

These are useful for getting names and addresses into a format consistent with other data – for example if you want to split an address into street name, city, postcode; or join a surname and forename into a full name.

Cluster and edit…

A particularly powerful cleaning function in Google Refine, this looks at your column data and suggests ‘clusters’ where entries are similar. You can then ask it to change those similar entries so that they have the same value.

There is more than one algorithm (shown in 2 drop-down menus: Method and Keying function) used to cluster – try each one in turn, as some pick up clusters that others miss.

If you have any other tips on cleaning data with Google Refine, please add them.

18 thoughts on “Cleaning data using Google Refine: a quick guide

  1. Momoko Price

    Hi Paul,

    By the way, a while ago you posted an entry on filling out exact addresses using the Google Maps API and GRefine which referred to a Google Refine tutorial video on Data Augmentation. I’ve since noticed that this video’s been taken down. Do you know anywhere else it’s available? It was really informative.

    Reply
  2. Paul Bradshaw Post author

    It may be that the video is still there but the embed code no longer works – will try to find the post in question.

    Reply
  3. Momoko Price

    Yep — here’s the original post from March 2011:

    http://paulbradshaw.wpengine.com/2011/03/18/getting-full-addresses-for-school-data-in-an-foi-response/

    Note the data augmentation video embed — Google made it private all of a sudden, and I’ve never figured out why or how to get around this. It’s a shame, because the video (which explains how to reconcile with Freebase and pretty much keep adding data ad infinitum to your original set) was really great.

    Wondering if there might be a public substitute version out there …

    Reply
    1. Paul Bradshaw Post author

      How curious – you’re right. As you say, it was a very useful video – not aware of any duplicates sorry.

      Reply
  4. Pingback: The inverted pyramid of data journalism | Online Journalism Blog

  5. Pingback: Hands-outs and plans for data track at Global Investigative Journalism Conference in Kiev

    1. Paul Bradshaw Post author

      Try using Facet > text facet and then in the section on the left where it shows each value, click ‘edit’ next to one and change it to what you want it to be changed to. Then repeat for the other value if relevant.

      Reply
      1. Deepak Taunk

        I have a file of 30,000 instances ..can’t do it for all of them ..is there any other shorter way…thnks

      2. Paul Bradshaw Post author

        That’s what Refine does for you. If you create a facet, you only need to edit a value in that facet once (on the left hand column) and all instances of that value will be changed. E.g. if your facet says “uk;usa” – 15,000 – edit and you click edit it will change all 15,000 to whatever you change it to.

  6. Pingback: A sample dirty dataset for trying out Google Refine | Online Journalism Blog

  7. Pingback: TomMeagher.com » Blog Archive » More tips for using OpenRefine

  8. Pingback: Cleaning data using Google Refine: a quick guide | Online Journalism Blog

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.