A sample dirty dataset for trying out Google Refine

I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:

  • Subheadings that are only used once (and you need them in each row where they apply)
  • Odd characters that stand for something else (e.g. a space or ampersand)
  • Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted

It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).

Thanks to Eva Constantaras for suggesting the idea.

About these ads

One thought on “A sample dirty dataset for trying out Google Refine

  1. Pingback: Useful Tools

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s