How to: convert XML or JSON into spreadsheets using Open Refine

Curly brackets pattern by Dan McCullough

One of the most useful applications of the data cleaning tool Open Refine (formerly Google Refine) is converting XML and JSON files into spreadsheets that you can interrogate in Excel.

Surprisingly, I’ve never blogged about it. Until now.

Converting XML files

XML is a highly structured format often used to publish data. You can read more about it in Data for journalists: understanding XML and RSS (although you can ignore the stuff about Yahoo! Pipes, which is now closed).

For this tutorial I’m going to use the XML published by the Food Standards Agency on its hygiene ratings API.

Go to that link, scroll down until you see the section with ‘Downloadable XML data files‘ (shown below), and click on it.

The Food Standards API has a series of links to XML files with the latest food hygiene data

If you click on one of those ‘English language’ links one of two things will happen: either your browser will show you an XML file (shown below), or (if you are using a browser like Internet Explorer) it will try to download it.

What the XML file looks like in a browser

If you’re using Internet Explorer then switch to a better browser like Chrome or Firefox. Copy the URL of the XML file (it should end in .xml).

Using Open Refine to convert XML

Now you have the URL for the XML it’s time to launch Open Refine (download it here).

It should open in your browser at 127.0.0.1:3333 – you can also copy and paste this address into any other browser and it should work as long as Open Refine is still running.

Click on Create Project and then, in the area to the right, select Web Addresses (URLs).

Paste the URL that you copied earlier (the Food Standards Agency XML data).

Click Next.

You will now be taken to a screen to ‘Configure Parsing Options‘. The bottom half allows you to specify how this data is going to be treated: it should automatically detect that it is XML and that option should be ticked. But if not you can tick it yourself.

The top half will show you the actual XML, asking you to ‘Click on the first XML element corresponding to the first record to load‘.

The ‘first record’ means the first row. Because this data has information on a number of different establishments, this means you need to click on the element which means ‘the first establishment’.

XML has a tree structure which is shown visually by each ‘branch’ being indented slightly from the ‘trunk’ or branch that it comes from. The main ‘trunk’ in this particular XML is right at the top: ‘FHRSEstablishment’.

From that trunk there are two branches: ‘Header’ and ‘EstablishmentCollection’.

From the branch ‘EstablishmentCollection’ are a series of branches all called ‘EstablishmentDetail’

And from the first ‘EstablishmentDetail’ branch there are about a dozen further branches all at the same level, starting with ‘FHRSID’ (scroll down and you’ll see the same is the case for each ‘EstablishmentDetail’ branch).

Shown as a bullet list, it looks like this:

FHRSEstablishment
- Header
  - ExtractDate
  - ItemCount
  - ReturnCode
- EstablishmentCollection
  - EstablishmentDetail
    - FHRSID
    - LocalAuthorityBusinessID
    - …and so on

The element corresponding to the first record to load is EstablishmentDetail. That’s because this branch refers to one establishment, i.e. one row. In contrast ‘EstablishmentCollection’ refers to all establishments (the ‘collection’) and ‘FHRSID’ only refers to each establishment’s ID.

So: hover over the bit that says <establishment-detail> and click.

Notice when you hover over EstablishmentDetail it also selects the branches within that, each of which will be a separate column

That should give you a preview showing a row for each establishment, and columns for each piece of data about those. If it doesn’t try again.

Once you get the right preview, give it a name in the box in the upper right corner, and click Create Project.

Once the project is created you don’t need to do anything else: click Export in the top right corner and choose CSV or XLS.

Now you can use it in Excel or another spreadsheet package.

Converting JSON using Open Refine

The process for converting JSON is much the same. If you need a file to practise with, try the UK Police API. This provides data in JSON format on crimes, police forces and other information.

A simple one to start with is their JSON data for all forces. You can read the documentation here, but the URL for the data itself is https://data.police.uk/api/forces

As above, create a new project based on a URL (if you had downloaded the JSON as a file you would create it from your computer instead) and paste that URL.

This time you will be shown the JSON and asked ‘Click on the first JSON { } node corresponding to the first record to load.‘

This is basically the same request as it gives you for XML, only phrased slightly different. This time it is asking for a ‘node’ rather than an ‘element’, but the principle is the same.

As you can see from the screenshot again, if you hover over the first curly bracket you will see all the data from that first ‘record’ highlighted: the ID of the force, and its name.

Once you think you have highlighted the first ‘record’ click to select it and you will get a preview as with the XML instructions above, and can continue to create the project then export it.

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

How to: convert XML or JSON into spreadsheets using Open Refine

Converting XML files

Using Open Refine to convert XML

Converting JSON using Open Refine

6 thoughts on “How to: convert XML or JSON into spreadsheets using Open Refine”

Leave a comment Cancel reply

Converting XML files

Using Open Refine to convert XML

Converting JSON using Open Refine

Share this:

Related

6 thoughts on “How to: convert XML or JSON into spreadsheets using Open Refine”

Leave a comment Cancel reply