Here’s an example of how APIs can be useful to journalists when they need to combine two sets of data.
I recently spoke to Lincoln investigative journalism student Sean McGrath who had obtained some information via FOI that he needed to combine with other data to answer a question (sorry to be so cryptic).
He had spent 3 days cleaning up the data and manually adding postcodes to it. This seemed a good example where using an API might cut down your work considerably, and so in this post I explain how you make a start on the same problem in less than an hour using Excel, Google Refine and the Google Maps API.
Step 1: Get the data in the right format to work with an API
APIs can do all sorts of things, but one of the things they do which is particularly useful for journalists is answer questions.
If we give the Google Maps API an address, for example, it will give us all sorts of information in return, such as latitude and longitude, postcode, and so on (I’ll explain how to do this later in the post). That’s what we’re going to use here – but we might use other APIs or datasets instead.
Sean’s spreadsheet had one column for school names and another for its town or city – but we needed those details to be together so we had a complete ‘address’. In order to do that we needed to open the spreadsheet in Excel and create a new column that combined the two.
I created the new column on the left (column A) and typed the following into cell A3:
=CONCATENATE(B3, ", ", I3)
This copies the value in cell B3, puts a comma and space after it (“, “), and then copies whatever is in I3. In other words, it combines the two cells to create a full address.
To copy the formula down the whole spreadsheet for all the other rows I used my favourite ever shortcut: hold down CTRL and clicked on the + in the bottom right corner of that cell.
Now it’s ready to use in Google Refine.
Step 2: Using Google Refine to ask Google Maps API a question
Now open the spreadsheet in Google Refine.
In Refine click on the arrow at the top of column A (the one you created using =CONCATENATE) and select Edit Column > Add Column by fetching URLs
A new window will appear with a code box. Type this:
"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=" + escape(value, "url")
That basically creates a URL by adding the address in column A (‘value’) to the Google Maps API URL. The URL itself is basically the spreadsheet ‘asking’ the Google Maps API to give it all the information it has about the address – it also asks it to provide that information in a format called JSON (note ‘json’ in the URL)
You can see all this being done in Google Refine’s own video:
Give your new column a name, and click OK. You’ll see that the new column contains a raft of code – JSON – about each school. This contains all that geographical information – but we still need to extract an address from that.*
Step 3: Using Google Refine to extract the address from the Google Maps data
Create a further column based on the one you’ve just created by clicking the arrow at the top and selecting Edit Column > Add Column based on this column
We need to write some more code. This took a bit of trial and error but here’s what I ended up with:
‘value’ is the value in the column we’re basing this new column on. parseJson looks through the JSON code. If you look in it you’ll see there’s a bit called ‘results’, and within that a bit called ‘formatted address’ which has what we need.
Now we have a new column with the full address – including postcode.
Step 4: Using Excel to split the address up again
We can now export this (Export is in the upper right corner of Google Refine) as a spreadsheet and open it up in Excel again.
To split that address column into its parts select it and then select Data > Text to columns to split that address into separate items, with postcodes in their own. (There are other ways you could do this, for example extracting the last 5 characters of each cell instead).
Alternatively, you could get the postcode from the JSON directly, with a different line of code at Step 3 (if you work this out let me know) – or you could extract the lat/long as detailed in the video and use the Postcodes API at http://www.uk-postcodes.com/api.php to get the postcode from that. As always there are various ways to crack the nut.
Note: I haven’t ‘learned’ JSON or GREL or any other language in this – just done a bit of searching (it took around an hour) to find the code that I needed and adapted it with educated guesswork.
*Problems 1 and 2: not all addresses return results from Google Maps because we haven’t given it enough detail. Also, there’s a 2500 limit on ‘free’ calls to their API – and we have 5000+ records, so almost 2000 are returned ‘LIMIT EXCEEDED’. A possible solution to the latter would be to split this into 2 spreadsheets and then merge the results later. A possible solution to the former may be to find – or create by scraping – another dataset that has more address information (for example this one).
FROM THE COMMENTS: Chip Oglesby suggests some other workarounds, including doing it all in Refine and using the Yahoo Maps API for half of the calls.
UPDATE: Tony Hirst follows on from this and finds other solutions to some of the problems outlined.