How to: clean a converted PDF using Open Refine

Our initial table

This spreadsheet sent in response to an FOI request appeared to have been converted from PDF format

In a guest post post for OJB, Ion Mates explains how he used OpenRefine to clean up a spreadsheet which had been converted from PDF format. An earlier version of this post was published on his blog.

Journalists rarely get their hands on nice, tidy data: public bodies don’t have an interest in providing information in a structured form. So it is increasingly part of a journalist’s job to get that information into the right state before extracting patterns and stories.

A few months ago I sent a Freedom of Information request asking for the locations of all litter bins in Birmingham. But instead of sending a spreadsheet downloaded directly from their database, the spreadsheet they sent appeared to have been converted from a multiple-page PDF.

This meant all sorts of problems, from rows containing page numbers and repeated header rows, to information split across multiple rows and even pages.

In this post I’ll be taking you through I used the free data cleaning tool OpenRefine (formerly Google Refine) to tackle all these problems and create a clean version of the data.

Importing and preparing

Here again is the spreadsheet I had to start with:

Our initial table

The start of the original spreadsheet. Note that the title ‘Litterbins’ is not needed, and some locations are split across more than one row

A good first stage when cleaning data is to identify the problems. These included:

  • blank cells
  • empty columns.
  • headers and page numbers repeating
Page numbers and headers appearing regularly and must be removed

Row 53 is just the first of many to include a page number and the header row repeated

We also want each pair of coordinates to have a location description that occupies a single row.

OpenRefine is specifically designed for problems like this.

First, we import our file and we parse our first two rows as column headers. (You can find a more detailed explanation of this cleaning-during-import process on this post on Online Journalism Blog.)

We then delete empty columns and rename the other columns accordingly.

Now our table looks like this:

Our table in OpenRefine is ready to be cleaned

After some basic cleaning during the import process, we have a simpler dataset to clean

Removing recurring headers

We should first start removing header names that keep repeating.

One way to do this would be to find each ROAD_NAME string and delete it (something you could do with Find and Replace in Excel).

However, it is better practice to create a separate column with the cleaned data so that you can always return to the original in case of a mistake, or to check the results against it.

To create this new column left-click on the drop-down menu at the top of the column we have renamed as Road name. Then select Edit column>Add column based on this column…

In the window that appears we give a name to the new column we are about to create. Then we use an expression (in Google Refine Expression Language, or GREL) to grab the value in each cell of the column we first selected, and replace ROAD_NAME with nothing (“”):

value.replace("ROAD_NAME", "")

What this means is that the new column will be the same as the column that was selected, but where the string of characters ‘ROAD_NAME’ appears, that will be replaced with no characters.

In practice, this means that most cells will be copied across unchanged, but those cells with column headings will be copied across as blanks. We can be confident of this because no road name should have that string of characters in it – only column headings. If it was a more common string of characters we may have to be more cautious or add more steps.

We do the same for Location and Ward.

We can remove the recurring POINT_X from the last column by creating a new column in the same way, but there is another way too:

We can change the column itself by clicking on the drop-down menu at the top of that column and selecting Edit column>Transform, then typing a similar expression to before. The key difference is that this transforms the column itself rather than creating a new one based on it.

The table now has extra columns with the headers removed

The results after 3 new columns are created which remove the header cells, and the Point Y column is cleaned too.

Joining rows

Some of our cells have multiple rows and multiple values which we need to join. To correct this we first need a reliable column that always has only one cell for each entry and is never split between multiple rows. A column we can use as our index column.

In our case, it’s Point X or Y – unlike the addresses, these values are never split – so we move one of them at the beginning by clicking the drop-down menu at the top of the column and selecting Edit column>Move column to beginning

Now, for each cleaned column we can click on its drop-down menu and go to Edit cells>Join multi-valued cells… 

We’ll be asked for a delimiter (a character that is inserted between each cell being joined), which in our case is a blank space; so just hit your space-bar.

After that we delete the old columns.

Now each entry occupies a single row

Multi-valued entries have been joined. Now each entry occupies a single row

We’re almost done. As can be seen above, empty rows remain where the old values were deleted. This is why we move Point X to the beginning.

We can use that column to filter our data based on certain criteria – this is called faceting in OpenRefine – and then delete the data which meets it. In this case, we need to facet so that we can only see the blank rows – and then delete them.

To do that click on the drop-down meny on the Point X column and select Facet>Customized facets>Facet by blank. 

A facet initially works much like a spreadsheet pivot table. We’re telling OpenRefine to display the ‘count’ of particular values in a column.

In this case, we’ve faceted on blank cells so a window in the left will display a count of True entries (those which are blank), and a count of False entries (those which are not blank).

Faceting the first column for blank cells

Faceting on blank cells will give you a count of the number of rows where that cell is blank (true) and the number where it is not blank (false)

We can now select the True entries (by clicking on true in the facet on the left).

We are now only looking at all of Point X’s blank cells as well as the rows that go with them. You should be able to see this because above the data you should see something like ‘XXXX matching rows (YYYY total)‘ (where the XXXX is a number of matching rows and YYYY is the number of rows in the data as a whole when not faceted).

All we need to do is click on the drop-down menu on the All column (furthest left), and select Edit rows>Remove all matching rows.

Oh no! All of your data has disappeared! That’s because OpenRefine is still faceted to only show blank rows, of which there are now none. Just close the facet window on the left and now you can see all of your data. This is now nice and clean, with those empty rows removed.

Our data is now ready to be interrogated

Now cleaned, our data is now ready to be interrogated

OpenRefine is a powerful tool for doing this sort of work and contrary to popular belief, one needn’t learn all of it. Just use what you know and look up what you don’t.

A final clean with regex

The table looks good – but some cells from Location still have the page numbers on them. If you want to take it further, you can get rid of these so that the addresses are completely clean.

To do this, we’ll have to use something called Regex. This is short for ‘regular expression’ and is a very useful way of performing some sort of process wherever a particular pattern of characters occurs.

That’s the situation in our case: what we want to delete has the form of Page [number] of 192. We can, then, instruct OpenRefine to look for this regular expression. A list of commands for describing regular expressions is here.

So, we first make a duplicate of our column. Then we apply Edit cells>Transform on it with the following formula:

value.replace(/\sPage\s\d\d?\d?\sof\s192/, "")

And here’s the explanation:

  • Firstly, the whole regular expression is contained within the forward slash character `/` – this just tells OpenRefine that we are using regex
  • Then `\s` indicates a space
  • `Page` simply indicates the characters P, a, g, e in that order. This is followed by another space.
  • `\d` indicates a digit
  • `\d?` means an optional digit: in other words, there may be a digit here, or not. There are two of these. This is because our page number could be anything from 1 (one digit) to 99 (two digits) to 999 (three digits).
  • `of\s192` again indicates those two characters ‘of’, followed by a space, followed by the numbers 192.
Click to see an interactive clickable version of this regex

Click to see an interactive clickable version of this regex

With those characters replaced we have cleaned the final bit of ‘dirt’ from our dirty data, and we’re ready to use it.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.