Getting full addresses for data from an FOI response (using APIs)

heatfullcolour11-960x1024

Here’s an example of how APIs can be useful to journalists when they need to combine two sets of data.

I recently spoke to Lincoln investigative journalism student Sean McGrath who had obtained some information via FOI that he needed to combine with other data to answer a question (sorry to be so cryptic).

He had spent 3 days cleaning up the data and manually adding postcodes to it. This seemed a good example where using an API might cut down your work considerably, and so in this post I explain how you make a start on the same problem in less than an hour using Excel, Google Refine and the Google Maps API.

Step 1: Get the data in the right format to work with an API

APIs can do all sorts of things, but one of the things they do which is particularly useful for journalists is answer questions.

If we give the Google Maps API an address, for example, it will give us all sorts of information in return, such as latitude and longitude, postcode, and so on (I’ll explain how to do this later in the post). That’s what we’re going to use here – but we might use other APIs or datasets instead.

Sean’s spreadsheet had one column for school names and another for its town or city – but we needed those details to be together so we had a complete ‘address’. In order to do that we needed to open the spreadsheet in Excel and create a new column that combined the two.

I created the new column on the left (column A) and typed the following into cell A3:

=CONCATENATE(B3, ", ", I3)

This copies the value in cell B3, puts a comma and space after it (“, “), and then copies whatever is in I3. In other words, it combines the two cells to create a full address.

To copy the formula down the whole spreadsheet for all the other rows I used my favourite ever shortcut: hold down CTRL and clicked on the + in the bottom right corner of that cell.

Now it’s ready to use in Google Refine.

Step 2: Using Google Refine to ask Google Maps API a question

Now open the spreadsheet in Google Refine.

In Refine click on the arrow at the top of column A (the one you created using =CONCATENATE) and select Edit Column > Add Column by fetching URLs

A new window will appear with a code box. Type this:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=" + escape(value, "url")

That basically creates a URL by adding the address in column A (‘value’) to the Google Maps API URL. The URL itself is basically the spreadsheet ‘asking’ the Google Maps API to give it all the information it has about the address – it also asks it to provide that information in a format called JSON (note ‘json’ in the URL)

You can see all this being done in Google Refine’s own video:

Give your new column a name, and click OK. You’ll see that the new column contains a raft of code – JSON – about each school. This contains all that geographical information – but we still need to extract an address from that.*

Step 3: Using Google Refine to extract the address from the Google Maps data

Create a further column based on the one you’ve just created by clicking the arrow at the top and selecting Edit Column > Add Column based on this column

We need to write some more code. This took a bit of trial and error but here’s what I ended up with:

value.parseJson().results[0].formatted_address

‘value’ is the value in the column we’re basing this new column on. parseJson looks through the JSON code. If you look in it you’ll see there’s a bit called ‘results’, and within that a bit called ‘formatted address’ which has what we need.

Now we have a new column with the full address – including postcode.

Step 4: Using Excel to split the address up again

We can now export this (Export is in the upper right corner of Google Refine) as a spreadsheet and open it up in Excel again.

To split that address column into its parts select it and then select Data > Text to columns to split that address into separate items, with postcodes in their own. (There are other ways you could do this, for example extracting the last 5 characters of each cell instead).

Alternatively, you could get the postcode from the JSON directly, with a different line of code at Step 3 (if you work this out let me know) – or you could extract the lat/long as detailed in the video and use the Postcodes API at http://www.uk-postcodes.com/api.php to get the postcode from that. As always there are various ways to crack the nut.

Note: I haven’t ‘learned’ JSON or GREL or any other language in this – just done a bit of searching (it took around an hour) to find the code that I needed and adapted it with educated guesswork.

*Problems 1 and 2: not all addresses return results from Google Maps because we haven’t given it enough detail. Also, there’s a 2500 limit on ‘free’ calls to their API – and we have 5000+ records, so almost 2000 are returned ‘LIMIT EXCEEDED’. A possible solution to the latter would be to split this into 2 spreadsheets and then merge the results later. A possible solution to the former may be to find – or create by scraping – another dataset that has more address information (for example this one).

FROM THE COMMENTS: Chip Oglesby suggests some other workarounds, including doing it all in Refine and using the Yahoo Maps API for half of the calls.

UPDATE: Tony Hirst follows on from this and finds other solutions to some of the problems outlined.

14 thoughts on “Getting full addresses for data from an FOI response (using APIs)

  1. Tony Hirst

    I think this is a great example of how reconciliation against Linked Data (if it ever works) could be really powerful.

    For example, the education datastore on data.gov.uk gives structured address information (including a postcode identified as such) for UK schools up until about a year ago ( http://education.data.gov.uk/doc/school/100866 ) – unfortunately the data is now stale? It’s all very well opening up data but if it’s not maintained, we may have to question whether any given datastore is really useful for operational activity? Keeping a dataset up-to-date does of course mean we have to take the sustainability of a data source into account when publishing it…

    The Google Refine reconciliation API provides a mechanism for identifying explicitly what a resource refers to by associating it with a unique identifier that has already been minted for that resource in a particular way. For an introduction (albeit one that is a bit techie!) see Leigh Dodds’ post here: http://www.ldodds.com/blog/2010/08/gridworks-reconciliation-api-implementation/

    By the by, the new Talis Kasabi service ( kasabi.com – now in private beta) offers reconciliation APIs against datastores published via Kasabi, though I’ve yet to see how easy it is to exploit these via Google Refine

    tony

    Reply
  2. Chip Oglesby

    It’s good that yall used Excel, Refine and the Maps API, but really all of this could have been done in Google Refine to begin with.
    I’ve been working something very similar the past few weeks and it’s really easy to clean up messy data.

    Instead of using concate in excel, you can use the value + “, ” + cells[‘column’].value

    For the json, you might want to try parseJson(value).results[0].formatted_address

    To split the columns up, you might want to try create a new column based on your current column: cells[‘column’].value.split(” “)[3,4]

    Also, you might consider using Google Maps for 2,500 calls and Yahoo Maps for the other 2,500 calls.

    Good luck!

    Reply
  3. Pingback: A First Quick Viz of UK University Fees « OUseful.Info, the blog…

  4. Pingback: Tech Tips: Making Sense of JSON Strings – Follow the Structure « OUseful.Info, the blog…

  5. Pingback: Getting full addresses for data from an FOI response (using APIs) | Online Journalism Blog

  6. Momoko Price

    Wonderfully helpful and enlightening post. Please keep technical-tip posts like this coming!

    The Google Refine data augmentation video kind of blew my mind. Now if I can just find the time to make good use of it!

    Momoko (@buzzdata)

    Reply
  7. Pingback: SFTW: How to grab useful political data with the They Work For You API | Online Journalism Blog

  8. Pingback: Hands-outs and plans for data track at Global Investigative Journalism Conference in Kiev

  9. Pingback: Geocoding Using the Google Maps Geocoder via OpenRefine « OUseful.Info, the blog…

  10. Pingback: SFTW: How to grab useful political data with the They Work For You API | Online Journalism Blog

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.