I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:
- Subheadings that are only used once (and you need them in each row where they apply)
- Odd characters that stand for something else (e.g. a space or ampersand)
- Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted
It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).
Thanks to Eva Constantaras for suggesting the idea.
UPDATE: Peter Verweij has put together an introduction to some other cleaning techniques here.