How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)

Cablesearch logo

CableSearch is a neat project by the European Centre for Computer Assisted Research and VVOJ (the Dutch-Flemish association for investigative journalists) which aims to make it easier for journalists to interrogate the Wikileaks cables. Although it’s been around for some time, I’ve only just noticed the site’s API, so I thought I’d show how such an API can be useful as a way to draw on such data sources to complement data of your own.

Example question: “How many Swedish party leaders are mentioned in the cables?”

There’s no particular reason why I picked Sweden, but this is an exercise you could do with any list – MPs, cabinet members, organisational heads, etc.

First, you need to grab the list. I did so by using the =importHTML formula on this Wikipedia page. You would obviously need to check that. Alternatively, you could use =importXML on this official Swedish parliament page for a list of ministers.

(I’m not going to repeat these processes as you can read how to do these by clicking through to the links explaining them above)

Here are the results. As often happens with Wikipedia tables, the first row is shifted so the headings don’t quite match the columns below. As we only need a list of names we don’t have to correct that. (For the =importXML scrape, you’ll also encounter a problem with accented characters, but this will still be quicker to correct than if we were manually copying the list across)

Now download that spreadsheet as a CSV file, and open up Google Refine.

Testing with the API

I’ve previously explained how to use Google Refine with the APIs of Google Maps, UK-Postcodes, and They Work For You (UK politics).

The CableSearch API page is pretty straightforward if you’ve followed any of those – but it’s key that you test what results Google Refine provides against what you get from a manual search (and make sure you have a test that provides unusual results – in this case, anything less than 10 results).

In particular, testing reveals that your search term needs to first be formatted in a particular way to avoid you getting the wrong results.

Formatting your data

So in our data we have a list of names – but if we just run them through CableSearch we will get results where those names do not appear together. In other words, a search for John Jones will bring back results where anyone called John and anyone called Jones is mentioned.

The normal solution is to put quotation marks around the search term, to ensure that only results containing that exact phrase are returned, i.e. “John Jones”.

With an API where we are constructing a URL, however, that space can cause problems because a URL cannot contain a space. We need to replace it with a code for a space: %20 (if you do a search for anything containing a space, you will notice that %20 will sometimes appear in the URL for the results in its place; at other times a + sign will replace the space)

So, here’s how to reformat the text accordingly:

  1. Click on the arrow at the top of your column of names, and select Edit Column > Add column based on this column…
  2. In the window that appears type the following code: ‘”‘+value.split(” “).join(“%20”)+'”‘
  3. Give the column a name and click OK.

The start and end may be difficult to see, so here it is with spaces in between:

‘ ” ‘

You’ll see that it’s a single inverted comma followed by double inverted commas and a further single inverted comma. That adds double inverted commas at the start and end of our new data.

The rest of the code splits the original data wherever there is a space (” “) and joins the resulting fragments together with “%20”.

And so John Jones becomes “John%20Jones” – which will work in the API (one cell has 2 names, however, which you will need to clean up).

Grabbing from the API

Now that we have properly formatted text we can ask the CableSearch API for the information it has on each name. Here’s how:

  1. Click on the arrow at the top of the newly created column of formatted names, and select Edit Column > Add column by fetching URLs
  2. In the window that appears type the following code:”+value
  3. Give the column a name and click OK.

It will now go and fetch data for each name, which may take a few minutes (or more, depending how many names you have).

When it’s finished you should have a column of cells containing JSON data. It will be very hard to look at (more on how to read JSON here) but that’s OK because we’re going to create a final column to extract the piece of data we want.

Extracting from the JSON

The process should be familiar by now:

  1. Click on the arrow at the top of the newly created column of formatted names, and select Edit Column > Add column based on this column…
  2. In the window that appears type the following code: value.parseJson().info.items
  3. Give the column a name and click OK.

This will create a new column which just tells you how many results there are for each name. Where it says ’10’ there are probably more (that’s the maximum value – sadly the API doesn’t return any information on total records, although the API page details one way you can continue to cycle through pages of results beyond the first 10).

This enables you to take a list of names and quickly find out which ones are mentioned in the cables at all, and which ones have been mentioned just a few times – saving you lots of searches, and time, and allowing you to narrow the focus of your work.

A more powerful API would allow you to narrow your focus further: by date range, for example, or source, urgency or classification. The broader point is: this is why APIs are useful. Knowing how to use them (and which ones there are) simply gives you another way to do a job better.


5 thoughts on “How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)

  1. Udo

    That’s a great article detailing the API. I just want to point out that you can use any of the commands from the Cablesearch site UI in the API. For example, “origin:Beijing” would restrict your results to cables from Beijing and so on.

    1. Paul Bradshaw Post author

      Thanks – was hoping that was the case (which was part of the reason I blogged this). Looking forward to the cheat sheet.

  2. Henk van Ess

    Btw.. we were a little lazy with documenting all the stuff.. you can already use commands like “origin:” or “date:” and more! to refine things, Udo will probably post a full cheat sheet

  3. Pingback: Hands-outs and plans for data track at Global Investigative Journalism Conference in Kiev

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s