SFTW: Scraping data with Google Refine

Paul Bradshaw — Fri, 13 Jan 2012 08:27:12 +0000

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0’) UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like

. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads

. If it was

then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

More on how this data was used on Help Me Investigate Education.

Scraping data from a list of webpages using Google Docs

Paul Bradshaw — Fri, 14 Oct 2011 20:56:40 +0000

Quite often when you’re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one – or even scrape the data individually – would take time. Here I explain a way to use Google Docs to grab the data for you.

Some basic principles

Although Google Docs is a pretty clumsy tool to use to scrape webpages, the method used is much the same as if you were writing a scraper in a programming language like Python or Ruby. For that reason, I think this is a good quick way to introduce the basics of certain types of scrapers.

Here’s how it works:

Firstly, you need a list of links to the pages containing data.

Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example “http://www.country.com/data/australia” or “http://www.country.com/data/country2″. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).

Second, you need the destination pages to have some consistent structure to them. In other words, they should look the same (although looking the same doesn’t mean they have the same structure – more on this below).

The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.

Scraping with Google Docs using =importXML – a case study

If you’ve not used =importXML before it’s worth catching up on my previous 2 posts How to scrape webpages and ask questions with Google Docs and =importXML and Asking questions of a webpage – and finding out when those answers change.

This takes things a little bit further.

In this case I’m going to scrape some data for a story about local history – the data for which is helpfully published by the Durham Mining Museum. Their homepage has a list of local mining disasters, with the date and cause of the disaster, the name and county of the colliery, the number of deaths, and links to the names and to a page about each colliery.

However, there is not enough geographical information here to map the data. That, instead, is provided on each colliery’s individual page.

So we need to go through this list of webpages, grab the location information, and pull it all together into a single list.

Finding the structure in the HTML

To do this we need to isolate which part of the homepage contains the list. If you right-click on the page to ‘view source’ and search for ‘Haig’ (the first colliery listed) we can see it’s in a table that has a beginning tag like so:

	Abbotswell School	Aberdeen City	Scotland
Percentage of pupils	25.5%	16.3%	22.6%
25.5%

We can use =importXML to grab the contents of the table like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]“)

But we only want the links, so how do we grab just those instead of the whole table contents?

The answer is to add more detail to our request. If we look at the HTML that contains the link, it looks like this:

So it’s within a

Looking a little further up, the table that contains this cell uses HTML like this:

http://www.dmm.org.uk/colliery/h029.htm“>Haig Pit

tag – but all the data in this table is, not surprisingly, contained within

tags. The key is to identify which

tag we want – and in this case, it’s always the fourth one in each row.

So we can add “//td[4]” (‘look for the fourth

tag’) to our function like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]//td[4]“)

Now we should have a list of the collieries – but we want the actual URL of the page that is linked to with that text. That is contained within the value of the href attribute – or, put in plain language: it comes after the bit that says href=”.

So we just need to add one more bit to our function: “//@href”:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]//td[4]//@href”)

So, reading from the far right inwards, this is what it says: “Grab the value of href, within the fourth

tag on every row, of the table that has a style value of font-size:10pt”

Note: if there was only one link in every row, we wouldn’t need to include //td[4] to specify the link we needed.

Scraping data from each link in a list

Now we have a list – but we still need to scrape some information from each link in that list

Firstly, we need to identify the location of information that we need on the linked pages. Taking the first page, view source and search for ‘Sheet 89′, which are the first two words of the ‘Map Ref’ line.

The HTML code around that information looks like this:

(Sheet 89) NX965176, 54° 32' 35" N, 3° 36' 0" W

So if we needed to scrape this information, we would write a function like this:

=importXML(“http://www.dmm.org.uk/colliery/h029.htm”, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]“)

…And we’d have to write it for every URL.

But because we have a list of URLs, we can do this much quicker by using cell references instead of the full URL.

So. Let’s assume that your formula was in cell C2 (as it is in this example), and the results have formed a column of links going from C2 down to C11. Now we can write a formula that looks at each URL in turn and performs a scrape on it.

In D2 then, we type the following:

=importXML(C2, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]“)

If you copy the cell all the way down the column, it will change the function so that it is performed on each neighbouring cell.

In fact, we could simplify things even further by putting the second part of the function in cell D1 – without the quotation marks – like so:

//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]

And then in D2 change the formula to this:

=ImportXML(C2,$D$1)

(The dollar signs keep the D1 reference the same even when the formula is copied down, while C2 will change in each cell)

Now it works – we have the data from each of 8 different pages. Almost.

Troubleshooting with =IF

The problem is that the structure of those pages is not as consistent as we thought: the scraper is producing extra cells of data for some, which knocks out the data that should be appearing there from other cells.

So I’ve used an IF formula to clean that up as follows:

In cell E2 I type the following:

=if(D2=””, ImportXML(C2,$D$1), D2)

Which says ‘If D2 is empty, then run the importXML formula again and put the results here, but if it’s not empty then copy the values across‘

That formula is copied down the column.

But there’s still one empty column even now, so the same formula is used again in column F:

=if(E2=””, ImportXML(C2,$D$1), E2)

A hack, but an instructive one

As I said earlier, this isn’t the best way to write a scraper, but it is a useful way to start to understand how they work, and a quick method if you don’t have huge numbers of pages to scrape. With hundreds of pages, it’s more likely you will miss problems – so watch out for inconsistent structure and data that doesn’t line up.

Scraping data from a list of webpages using Google Docs

Paul Bradshaw — Fri, 14 Oct 2011 09:48:14 +0000

Some basic principles

Here’s how it works:

Firstly, you need a list of links to the pages containing data.

Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example “http://www.country.com/data/australia” or “http://www.country.com/data/country2”. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).

The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.

Scraping with Google Docs using =importXML – a case study

This takes things a little bit further.

However, there is not enough geographical information here to map the data. That, instead, is provided on each colliery’s individual page.

So we need to go through this list of webpages, grab the location information, and pull it all together into a single list.

Finding the structure in the HTML

We can use =importXML to grab the contents of the table like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, “//table[starts-with(@style, ‘font-size:10pt’)]”)

But we only want the links, so how do we grab just those instead of the whole table contents?

The answer is to add more detail to our request. If we look at the HTML that contains the link, it looks like this:

So it’s within a

Looking a little further up, the table that contains this cell uses HTML like this:

http://www.dmm.org.uk/colliery/h029.htm“>Haig Pit

tag – but all the data in this table is, not surprisingly, contained within

tags. The key is to identify which

tag we want – and in this case, it’s always the fourth one in each row.

So we can add “//td[4]” (‘look for the fourth

tag’) to our function like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, “//table[starts-with(@style, ‘font-size:10pt’)]//td[4]”)

So we just need to add one more bit to our function: “//@href”:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, “//table[starts-with(@style, ‘font-size:10pt’)]//td[4]//@href”)

So, reading from the far right inwards, this is what it says: “Grab the value of href, within the fourth

tag on every row, of the table that has a style value of font-size:10pt”

Note: if there was only one link in every row, we wouldn’t need to include //td[4] to specify the link we needed.

Scraping data from each link in a list

Now we have a list – but we still need to scrape some information from each link in that list

Firstly, we need to identify the location of information that we need on the linked pages. Taking the first page, view source and search for ‘Sheet 89’, which are the first two words of the ‘Map Ref’ line.

The HTML code around that information looks like this:

(Sheet 89) NX965176, 54° 32' 35" N, 3° 36' 0" W

So if we needed to scrape this information, we would write a function like this:

=importXML(“http://www.dmm.org.uk/colliery/h029.htm”, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]”)

…And we’d have to write it for every URL.

But because we have a list of URLs, we can do this much quicker by using cell references instead of the full URL.

In D2 then, we type the following:

=importXML(C2, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]”)

If you copy the cell all the way down the column, it will change the function so that it is performed on each neighbouring cell.

In fact, we could simplify things even further by putting the second part of the function in cell D1 – without the quotation marks – like so:

//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]

And then in D2 change the formula to this:

=ImportXML(C2,$D$1)

(The dollar signs keep the D1 reference the same even when the formula is copied down, while C2 will change in each cell)

Now it works – we have the data from each of 8 different pages. Almost.

Troubleshooting with =IF

So I’ve used an IF formula to clean that up as follows:

In cell E2 I type the following:

=if(D2=””, ImportXML(C2,$D$1), D2)

Which says ‘If D2 is empty, then run the importXML formula again and put the results here, but if it’s not empty then copy the values across‘

That formula is copied down the column.

But there’s still one empty column even now, so the same formula is used again in column F:

=if(E2=””, ImportXML(C2,$D$1), E2)

A hack, but an instructive one

How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)

Paul Bradshaw — Fri, 09 Sep 2011 12:33:25 +0000

CableSearch is a neat project by the European Centre for Computer Assisted Research and VVOJ (the Dutch-Flemish association for investigative journalists) which aims to make it easier for journalists to interrogate the Wikileaks cables. Although it’s been around for some time, I’ve only just noticed the site’s API, so I thought I’d show how such an API can be useful as a way to draw on such data sources to complement data of your own.

Example question: “How many Swedish party leaders are mentioned in the cables?”

There’s no particular reason why I picked Sweden, but this is an exercise you could do with any list – MPs, cabinet members, organisational heads, etc.

First, you need to grab the list. I did so by using the =importHTML formula on this Wikipedia page. You would obviously need to check that. Alternatively, you could use =importXML on this official Swedish parliament page for a list of ministers.

(I’m not going to repeat these processes as you can read how to do these by clicking through to the links explaining them above)

Here are the results. As often happens with Wikipedia tables, the first row is shifted so the headings don’t quite match the columns below. As we only need a list of names we don’t have to correct that. (For the =importXML scrape, you’ll also encounter a problem with accented characters, but this will still be quicker to correct than if we were manually copying the list across)

Now download that spreadsheet as a CSV file, and open up Google Refine.

Testing with the API

I’ve previously explained how to use Google Refine with the APIs of Google Maps, UK-Postcodes, and They Work For You (UK politics).

The CableSearch API page is pretty straightforward if you’ve followed any of those – but it’s key that you test what results Google Refine provides against what you get from a manual search (and make sure you have a test that provides unusual results – in this case, anything less than 10 results).

In particular, testing reveals that your search term needs to first be formatted in a particular way to avoid you getting the wrong results.

Formatting your data

So in our data we have a list of names – but if we just run them through CableSearch we will get results where those names do not appear together. In other words, a search for John Jones will bring back results where anyone called John and anyone called Jones is mentioned.

The normal solution is to put quotation marks around the search term, to ensure that only results containing that exact phrase are returned, i.e. “John Jones”.

With an API where we are constructing a URL, however, that space can cause problems because a URL cannot contain a space. We need to replace it with a code for a space: %20 (if you do a search for anything containing a space, you will notice that %20 will sometimes appear in the URL for the results in its place; at other times a + sign will replace the space)

So, here’s how to reformat the text accordingly:

Click on the arrow at the top of your column of names, and select Edit Column > Add column based on this column…
In the window that appears type the following code: ‘”‘+value.split(” “).join(“%20”)+'”‘
Give the column a name and click OK.

The start and end may be difficult to see, so here it is with spaces in between:

‘ ” ‘

You’ll see that it’s a single inverted comma followed by double inverted commas and a further single inverted comma. That adds double inverted commas at the start and end of our new data.

The rest of the code splits the original data wherever there is a space (” “) and joins the resulting fragments together with “%20”.

And so John Jones becomes “John%20Jones” – which will work in the API (one cell has 2 names, however, which you will need to clean up).

Grabbing from the API

Now that we have properly formatted text we can ask the CableSearch API for the information it has on each name. Here’s how:

Click on the arrow at the top of the newly created column of formatted names, and select Edit Column > Add column by fetching URLs
In the window that appears type the following code: “http://cablesearch.org/cable/api/search?q=”+value
Give the column a name and click OK.

It will now go and fetch data for each name, which may take a few minutes (or more, depending how many names you have).

When it’s finished you should have a column of cells containing JSON data. It will be very hard to look at (more on how to read JSON here) but that’s OK because we’re going to create a final column to extract the piece of data we want.

Extracting from the JSON

The process should be familiar by now:

Click on the arrow at the top of the newly created column of formatted names, and select Edit Column > Add column based on this column…
In the window that appears type the following code: value.parseJson().info.items
Give the column a name and click OK.

This will create a new column which just tells you how many results there are for each name. Where it says ’10’ there are probably more (that’s the maximum value – sadly the API doesn’t return any information on total records, although the API page details one way you can continue to cycle through pages of results beyond the first 10).

This enables you to take a list of names and quickly find out which ones are mentioned in the cables at all, and which ones have been mentioned just a few times – saving you lots of searches, and time, and allowing you to narrow the focus of your work.

A more powerful API would allow you to narrow your focus further: by date range, for example, or source, urgency or classification. The broader point is: this is why APIs are useful. Knowing how to use them (and which ones there are) simply gives you another way to do a job better.

SFTW: 9 data journalism tools

Paul Bradshaw — Fri, 19 Aug 2011 10:26:18 +0000

There have been quite a few tools springing up over the past few months that I’ve not had time to blog about, so here’s a roundup post on all of them – a bumper Something For The Weekend (let me know how you find these).

1. Junar – for scraping websites and sharing data

Junar presents a much easier way to scrape data from online tables with its ‘Collect Data‘ tool – and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs.

2. BuzzData – for sharing data

BuzzData is a platform for sharing data – essentially a social network where you can follow other data journalists or datasets, tag and license your data, and – importantly – add visualisations, articles and attachments. When someone else builds on your data, it tells you, which is nice.

3. DataMarket – for finding data

DataMarket is exactly what it says on the tin: a market for data from organisations including the UN, BP, Eurostat, the IMF, USGS, and various other acronyms. You can access the data for free, or pay for extra functionality such as exporting to Excel.

4. Google News Scraper – for grabbing data on news coverage

This scraper will allow you to gather data on coverage of a particular issue, event or person. It only gathers the teaser text but the country data may if you want to map coverage, while the URLs can provide a starting point for further scraping experiments.

5. Metadata extraction tool – a first step for searching document dumps?

This is aimed at file preservation activities, but it has a few possible applications for journalists. Firstly, it has a Windows interface for exploring the metadata of a bunch of files, making it possible to sort in different ways to more quickly look for information you’re seeking. Secondly, the generation of an XML file will give some structure which could allow you to, for example, plot your documents on a timeline, spotting patterns or outliers.

6. Roambi – data visualisation on your iPhone

Sadly, it’s only your iPhone, not anyone else’s, so this is more if you’re on the move but want to go through some private data visualisations which might hide a story.

7. Data Wrangler – web-based data cleaning tool

This looks pretty powerful, if not pretty full stop. Here’s a video:

8. Impure – visual programming language

From the About page:

“Impure is a visual programming language aimed to gather, process and visualize information. With impure is possible to obtain information from very different sources; from user owned data to diverse feeds in internet, including social media data, real time or historical financial information, images, news, search queries and many more. Impure is a tool to be in touch with data around internet, to deeply understand it. Within a modular logic interface you can quickly link information to operators, controls and visualization methods, bringing all the power of the comprehension of information and knowledge to the not programmers that want to work with information in a professional way.”

9. Zanran – PDF/spreadsheet/table search engine

This looks a very useful tool for narrowing down searches to PDFs, spreadsheets, and tables within webpages (the advanced search allows further narrowing by filetype, date, server location and site). Clever stuff behind it – particularly in the way it looks at images and decides if they’re charts. The site says they plan to add Word documents and PowerPoint presentations soon.

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

Paul Bradshaw — Fri, 29 Jul 2011 20:52:32 +0000

Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter.

Converting an XML feed into a table

If you have some information in XML format it helps if you have some understanding of how XML is structured. A backgrounder on how to understand XML is covered in this post explaining XML for journalists.

It also helps if you are using a browser which is good at displaying XML pages: Chrome, for example, not only staggers and indents different pieces of information, but also allows you to expand or collapse parts of that, and colours elements, values and attributes (which we’ll come on to below) differently.

Say, for example, you wanted a spreadsheet of UK council data, including latitude, longitude, CIPFA code, and so on – and you found the data, but it was in XML format at a page like this: http://openlylocal.com/councils/all.xml

To pull that into a neatly structured spreadsheet in Google Docs, type the following into the cell where you want the import to begin (try typing in cell A2, leaving the first row free for you to add column headers):

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council”)

The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:

=importXML(“theURLinQuotationMarks”, “theBitWithinTheURLthatYouWant”)

The URL is relatively easy – it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.

The query tells Google Docs which bit of the XML you want to pull out. It uses a language called XPath – but don’t worry, you will only need to note down a few queries for most purposes.

Here’s an example of part of that XML file shown in the Chrome browser:

The indentation and triangles indicate the way the data is structured. So, the tag contains at least one item called (if you scrolled down, or clicked on the triangle to collapse you would see there are a few hundred).

And each contains an

, , and many other pieces of information.

If you wanted to grab every from this XML file, then, you use the query “//council” as shown above. Think of the // as a replacement for the < in a tag – you are saying: ‘grab the contents of every item that begins ’.

You’ll notice that in your spreadsheet where you have typed the formula above, it gathers the contents (called a value) of each tag within , each tag’s value going into their own column – giving you dozens of columns.

You can continue this logic to look for tags within tags. For example, if you wanted to grab the value from within each tag, you could use:

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council//name”)

You would then only have one column, containing the names of all the councils – if that’s all you wanted. You could of course adapt the formula again in cell B2 to pull another piece of information. However, you may end up with a mismatch of data where that information is missing – so it’s always better to grab all the XML once, then clean it up on a copy.

If the XML is more complex then you can ask more complex questions – which I’ll cover in the second part of this post. You can also put the URL and/or query in other cells to simplify matters, e.g.

=ImportXML(A1, B1)

Where cell A1 contains http://openlylocal.com/councils/all.xml and B1 contains //council (note the lack of quotation marks). You then only need to change the contents of A1 or B1 to change the results, rather than having to edit the formula directly)

If you’ve any other examples, ideas or corrections, let me know. Meanwhile, I’ve published an example spreadsheet demonstrating all the above techniques here.

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

Paul Bradshaw — Fri, 29 Jul 2011 08:24:51 +0000

Image by dullhunk on Flickr

There are at least 2 distinct journalistic uses for =importXML:

You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter.

Converting an XML feed into a table

=ImportXML(“http://openlylocal.com/councils/all.xml”, “//council”)

The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:

=importXML(“theURLinQuotationMarks”, “theBitWithinTheURLthatYouWant”)

The URL is relatively easy – it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.

The query tells Google Docs which bit of the XML you want to pull out. It uses a language called XPath – but don’t worry, you will only need to note down a few queries for most purposes.

Here’s an example of part of that XML file shown in the Chrome browser:

And each contains an

, , and many other pieces of information.

You can continue this logic to look for tags within tags. For example, if you wanted to grab the value from within each tag, you could use:

=ImportXML(“http://openlylocal.com/councils/all.xml”, “//council//name”)

=ImportXML(A1, B1)

If you’ve any other examples, ideas or corrections, let me know. Meanwhile, I’ve published an example spreadsheet demonstrating all the above techniques here.

SFTW: How to grab useful political data with the They Work For You API

Paul Bradshaw — Fri, 22 Jul 2011 21:04:15 +0000

It’s been over 2 years since I stopped doing the ‘Something for the Weekend’ series. I thought I would revive it with a tutorial on They Work For You and Google Refine…

If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs for those constituencies – the They Work For You API can save you hours of fiddling – if you know how to use it.

An API is – for the purposes of journalists – a way of asking questions for reams of data. For example, you can use an API to ask “What constituency is each of these postcodes in?” or “When did these politicians enter office?” or even “Can you show me an image of these people?”

The They Work For You API will give answers to a range of UK political questions on subjects including Lords, MLAs (Members of the Legislative Assembly in Northern Ireland), MPs, MSPs (Members of the Scottish Parliament), select committees, debates, written answers, statements and constituencies.

When you combine that API with Google Refine you can fill a whole spreadsheet with additional political data, allowing you to answer questions you might otherwise not be able to.

I’ve written before on how to use Google Refine to pull data into a spreadsheet from the Google Maps API and the UK Postcodes API, but this post takes things a bit further because the They Work For You API requires something called a ‘key’. This is quite common with APIs so knowing how to use them is – well – key. If you need extra help, try those tutorials first.

The They Work For You API key

Unlike the previous APIs I’ve written about, the They Work For You API requires you to register for a ‘key’ to use it. If you don’t understand how this works the instructions on the TWFY website can be a little confusing. So here’s how it works:

The key is a password of sorts, used when you ask the API a question.

As your ‘question’ takes the form of a web address (URL) then that key needs to be included at a particular part of that URL.

You’ll see how that works when we get to asking the URL questions. But first, go to http://www.theyworkforyou.com/api/key to get a key.

Got it? OK, now copy it into a text document – or just keep this window open. You’ll need to paste it later.

Using the TWFY key

The API has a number of pre-set questions, called ‘functions’. These are listed in the right hand column, and include getMPs, getLord, getDebates and so on. If you click on any of these you will be given information on how they work, and you can also test the function with the ‘Explorer’.

To demonstrate how to use these functions, click on getConstituency.

If you use the ‘Explorer’ to test it (in this case with ‘Edinburgh South”) you will be shown a bunch of results at a URL like this:

http://www.theyworkforyou.com/api/docs/getConstituency?name=edinburgh+south&postcode=&output=js#output

Now you could manually use the Explorer to get information for each of the cells in a spreadsheet, but it’s much, much quicker to use the API to automate the process instead.

On that front the Explorer can be a little misleading. Because although it shows you the information you might get from the API, this is not the URL that you will need.

The URL you really need is shown above the results, and below the word ‘Output’ like so:

http://www.theyworkforyou.com/api/getConstituency?name=edinburgh+south&output=js

If you copy and paste that URL into your browser you will get the following warning:

{

error: “No API key provided. Please see http://www.theyworkforyou.com/api/key for more information.”

}

So now we need that key.

Using your key

Assuming you still have your API key copied somewhere, or still open in another window, you can find instructions on how to use it at http://www.theyworkforyou.com/api/

Here you are told to use the key as part of the following structure:

http://www.theyworkforyou.com/api/function?key=key&output=output&other_variables

The important bit is where it says key=key&

That is where you need to add your own key, so that that part of the URL looks something like

key=aTh0jklerJaHui7&

(where that random assortment of characters is your key, copied earlier, followed by the & sign)

Going back for a moment to the URL that wasn’t working without a key, we can see that it can be split into two parts:

http://www.theyworkforyou.com/api/getConstituency?

and

name=edinburgh+south&output=js

Adding in the key in the middle makes up a third part, like so:

http://www.theyworkforyou.com/api/getConstituency?

and

key=key&

and

name=edinburgh+south&output=js

So, you now need to edit the output URL to include your API key. It should then look something like this:

http://www.theyworkforyou.com/api/getConstituency?key=AHdajHUShajshaJ&name=edinburgh+south&output=js

UPDATE: Matthew Somerville points out that the key can be used anywhere after the ? so you can tag it on the end if that’s easier.

The URL broken down further

Just to clarify, these are the parts:

http://www.theyworkforyou.com/

(The website hosting the API)

api/

(The API)

getConstituency?

(The function – or question being asked)

key=AHdajHUShajshaJ

(Our API key – or password)

&name=edinburgh+south

(and the constituency name that we are asking the API for information on)

&output=js

(and the format we want the answer in – JSON, in this case)

You should now get a page of JSON code giving data for the question. If your browser doesn’t display it particularly well, try Chrome or Firefox.

Using with Google Refine to get a bunch of results

Great. But we could get one result by using the ‘Explorer’, so why did we need to do all that? Because we can now use Google Refine to automate the process of asking the same question hundreds of times.

To demonstrate this, here’s a spreadsheet with 4 constituencies. Open it, and select File > Download as… > CSV

Open Google Refine (download here) and create a new project with that spreadsheet. Create a new column from the one you have by clicking on the arrow at the top of the column and selecting Edit Column > Add Column by fetching URLs

In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):

“http://www.theyworkforyou.com/api/getConstituency?key=Gr7jUUlKdhB3fsihFnHzab&name=”+value+”&output=js”

This generates a URL in each cell based on the value of the original column: the start and end of the URL are in quotation marks; the value is inserted in the middle where it says +value+

(NOTE: Avoid copying and pasting as quotation marks may cause you problems. Instead try typing it in yourself – this also helps you remember things) This generates a URL in each cell based on the value of the original column: the start and end of the URL are in quotation marks; the value is inserted in the middle where it says +value+

Give the column a name and click OK. It will now run – this test example only has 4 rows so you can see the results quickly.

You’ll see that only one row has actually worked – Tatton. The others have failed. Why? Because they have more than one word.

Take another look at that URL that the API returned earlier with the test of Edinburgh South:

http://www.theyworkforyou.com/api/getConstituency?key=AHdajHUShajshaJ&name=edinburgh+south&output=js

When a constituency has two words the space between them is represented by a plus sign – so we need to format our data in the same way for it to work.

Formatting data for the API

You could use Find and Replace in Excel to replace all spaces in that column with a plus sign but you will still hit problems with unusual constituency names. But this is how to do it in Google Refine:

Click on the arrow at the top of the constituency column and selecting Edit Column > Add column based on this column…

~~In the window that appears type the following GREL:~~

value.split(” “).join(“+”)

To explain:

‘Value’ is the value in each cell.

‘.split(” “)’ splits each value where there is a space (” “).

~~‘.join(“+”) then joins the resulting items together, with a plus sign.~~

~~Give it a name and click OK. You’ll see a new column with plus signs replacing the spaces.~~ [see comment from Matthew Somerville for explanation]

Create a new column from the one you have by clicking on the arrow at the top of the column and selecting Edit Column > Add Column by fetching URLs

In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):

“http://www.theyworkforyou.com/api/getConstituency?name=” + escape(value, “url”) + “&key=Gr7jUUlKdhB3fsihFnHzab&output=js”

The key part here is between the + signs. Whereas before we simply inserted the value of each cell, here we escape that value at the same time so that it will work in a URL.

This will change Edinburgh South to “edinburgh+south” but also Normanton, Pontefract and Castleford to ”Normanton%2C+Pontefract+and+Castleford” and any other unforeseen characters in similar ways.

Give this new column a name, click OK and watch your new column populate itself with the JSON from each URL.

Creating new columns from the JSON

Now we can populate new columns with data taken from that JSON as follows:

Click on the arrow at the top of the new JSON column and select Edit Column > Add column based on this column…

Type this GREL:

value.parseJson().bbc_constituency_id

(This looks in the JSON in each cell and pulls out the bit after bbc_constituency_id And click OK.

Repeat the process for further columns as follows:

value.parseJson().guardian_election_results

value.parseJson().pa_id

value.parseJson().guardian_id

Going further

That’s just a demonstration of how to use a small part of the They Work For You API – there are lots of other functions that you can use to get other information. Have a play with those.

Meanwhile, what about those IDs? Well, the Guardian ID will allow you to play with The Guardian’s API – which gives lots more information on each constituency. For an example see http://www.guardian.co.uk/politics/api/constituency/664/json

Based on that URL you can repeat the process above to grab more data.

Is this useful? Anything you can add? Or other data problems?

SFTW: How to grab useful political data with the They Work For You API

Paul Bradshaw — Fri, 22 Jul 2011 08:35:47 +0000

It’s been over 2 years since I stopped doing the ‘Something for the Weekend’ series. I thought I would revive it with a tutorial on They Work For You and Google Refine…

If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs for those constituencies – the They Work For You API can save you hours of fiddling – if you know how to use it.

When you combine that API with Google Refine you can fill a whole spreadsheet with additional political data, allowing you to answer questions you might otherwise not be able to.

The They Work For You API key

The key is a password of sorts, used when you ask the API a question.

As your ‘question’ takes the form of a web address (URL) then that key needs to be included at a particular part of that URL.

You’ll see how that works when we get to asking the URL questions. But first, go to http://www.theyworkforyou.com/api/key to get a key.

Got it? OK, now copy it into a text document – or just keep this window open. You’ll need to paste it later.

Using the TWFY key

To demonstrate how to use these functions, click on getConstituency.

If you use the ‘Explorer’ to test it (in this case with ‘Edinburgh South”) you will be shown a bunch of results at a URL like this:

http://www.theyworkforyou.com/api/docs/getConstituency?name=edinburgh+south&postcode=&output=js#output

Now you could manually use the Explorer to get information for each of the cells in a spreadsheet, but it’s much, much quicker to use the API to automate the process instead.

On that front the Explorer can be a little misleading. Because although it shows you the information you might get from the API, this is not the URL that you will need.

The URL you really need is shown above the results, and below the word ‘Output’ like so:

http://www.theyworkforyou.com/api/getConstituency?name=edinburgh+south&output=js

If you copy and paste that URL into your browser you will get the following warning:

{
error: “No API key provided. Please see http://www.theyworkforyou.com/api/key for more information.”
}

So now we need that key.

Using your key

Assuming you still have your API key copied somewhere, or still open in another window, you can find instructions on how to use it at http://www.theyworkforyou.com/api/

Here you are told to use the key as part of the following structure:

http://www.theyworkforyou.com/api/function?key=key&output=output&other_variables

The important bit is where it says key=key&

That is where you need to add your own key, so that that part of the URL looks something like

key=aTh0jklerJaHui7&

(where that random assortment of characters is your key, copied earlier, followed by the & sign)

Going back for a moment to the URL that wasn’t working without a key, we can see that it can be split into two parts:

http://www.theyworkforyou.com/api/getConstituency?

and

name=edinburgh+south&output=js

Adding in the key in the middle makes up a third part, like so:

http://www.theyworkforyou.com/api/getConstituency?

and

key=key&

and

name=edinburgh+south&output=js

So, you now need to edit the output URL to include your API key. It should then look something like this:

http://www.theyworkforyou.com/api/getConstituency?key=AHdajHUShajshaJ&name=edinburgh+south&output=js

UPDATE: Matthew Somerville points out that the key can be used anywhere after the ? so you can tag it on the end if that’s easier.

The URL broken down further

Just to clarify, these are the parts:

http://www.theyworkforyou.com/

(The website hosting the API)

api/

(The API)

getConstituency?

(The function – or question being asked)

key=AHdajHUShajshaJ

(Our API key – or password)

&name=edinburgh+south

(and the constituency name that we are asking the API for information on)

&output=js

(and the format we want the answer in – JSON, in this case)

You should now get a page of JSON code giving data for the question. If your browser doesn’t display it particularly well, try Chrome or Firefox.

Using with Google Refine to get a bunch of results

To demonstrate this, here’s a spreadsheet with 4 constituencies. Open it, and select File > Download as… > CSV

In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):

“http://www.theyworkforyou.com/api/getConstituency?key=Gr7jUUlKdhB3fsihFnHzab&name=”+value+”&output=js”

This generates a URL in each cell based on the value of the original column: the start and end of the URL are in quotation marks; the value is inserted in the middle where it says +value+

Give the column a name and click OK. It will now run – this test example only has 4 rows so you can see the results quickly.

You’ll see that only one row has actually worked – Tatton. The others have failed. Why? Because they have more than one word.

Take another look at that URL that the API returned earlier with the test of Edinburgh South:

http://www.theyworkforyou.com/api/getConstituency?key=AHdajHUShajshaJ&name=edinburgh+south&output=js

When a constituency has two words the space between them is represented by a plus sign – so we need to format our data in the same way for it to work.

Formatting data for the API

Click on the arrow at the top of the constituency column and selecting Edit Column > Add column based on this column…

~~In the window that appears type the following GREL:~~

value.split(” “).join(“+”)

To explain:

‘Value’ is the value in each cell.

‘.split(” “)’ splits each value where there is a space (” “).

~~‘.join(“+”) then joins the resulting items together, with a plus sign.~~

~~Give it a name and click OK. You’ll see a new column with plus signs replacing the spaces.~~ [see comment from Matthew Somerville for explanation]

Create a new column from the one you have by clicking on the arrow at the top of the column and selecting Edit Column > Add Column by fetching URLs

In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):

“http://www.theyworkforyou.com/api/getConstituency?name=” + escape(value, “url”) + “&key=Gr7jUUlKdhB3fsihFnHzab&output=js”

The key part here is between the + signs. Whereas before we simply inserted the value of each cell, here we escape that value at the same time so that it will work in a URL.

This will change Edinburgh South to “edinburgh+south” but also Normanton, Pontefract and Castleford to “Normanton%2C+Pontefract+and+Castleford” and any other unforeseen characters in similar ways.

Give this new column a name, click OK and watch your new column populate itself with the JSON from each URL.

Creating new columns from the JSON

Now we can populate new columns with data taken from that JSON as follows:

Click on the arrow at the top of the new JSON column and select Edit Column > Add column based on this column…

Type this GREL:

value.parseJson().bbc_constituency_id

(This looks in the JSON in each cell and pulls out the bit after bbc_constituency_id And click OK.

Repeat the process for further columns as follows:

value.parseJson().guardian_election_results

value.parseJson().pa_id

value.parseJson().guardian_id

Going further

That’s just a demonstration of how to use a small part of the They Work For You API – there are lots of other functions that you can use to get other information. Have a play with those.

Based on that URL you can repeat the process above to grab more data.

Is this useful? Anything you can add? Or other data problems?

Quote Twitter conversations with QuoteURL (Something for the Weekend #15)

Paul Bradshaw — Thu, 19 Mar 2009 21:41:28 +0000

Following on from the previous Something for the Weekend, Twickie, which allows you to collect responses to a question posted on Twitter, this tool allows you to present a conversation – with impressive control.

QuoteURL allows you to drag and drop (or copy and paste) Twitter tweet URLs to reconstruct a conversation.

Here’s one I prepared earlier:

dirkthecow great article on the “daily me” or how the Internet is narrowing our viewpoints, from Nicholas Kristof in the NYT http://bit.ly/lsH3 19 Mar 2009 from TweetDeck

paulbradshaw @dirkthecow I disagree. He has no evidence the Internet will make that worse. 19 Mar 2009 from m.slandr.net

dirkthecow @paulbradshaw agree he gives no evidence,but my own personal experience mirrors what he says,I follow and read stuff from people ‘like me’ 19 Mar 2009 from TweetDeck in reply to paulbradshaw

paulbradshaw @dirkthecow but we do that offline too -he’s suggesting the net makes it worse. Evidence doesn’t back that up. 19 Mar 2009 from Tweetie in reply to dirkthecow

dirkthecow @paulbradshaw but isn’t it so that when we all watched the evening news we were forced to hear different opinions, now we filter them more? 19 Mar 2009 from Tweetie in reply to paulbradshaw

paulbradshaw @dirkthecow one stat sticks out in my mind: 46% of people come across news stories while searching for something else 19 Mar 2009 from Tweetie in reply to dirkthecow

dirkthecow @paulbradshaw another interesting piece is this one on ‘homophily’ by Oliver Burkeman in The Guardian http://bit.ly/5qPXJ 19 Mar 2009 from TweetDeck in reply to paulbradshaw

— this quote was brought to you by quoteurl

The tool allows you to include up to 4 tweets without registering, 10 if you have, and more if you pay for an account (not available at the moment).

There are a few weaknesses to the service – when I tried it you couldn’t actually see any more than 4 tweets – although they were clearly being stored: you have to work ‘blind’ so to speak.

And it seems you cannot have more than 1 consecutive tweet by the same person.

What is nice is that you can have more than 2 people involved in the conversation, and tweets seem to be arranged chronologically, so you can drag them in in any order.

If you have a play with it, let me know how you get on – or any uses you can see.

Something for the weekend – Online Journalism Blog

SFTW: Scraping data with Google Refine

Assembling the ingredients

Bringing your data into Google Refine

Grabbing the HTML for each page

Extracting data from the raw HTML with parseHTML

Scraping data from a list of webpages using Google Docs

Some basic principles

Scraping with Google Docs using =importXML – a case study

Finding the structure in the HTML

Scraping data from each link in a list

Troubleshooting with =IF

A hack, but an instructive one

Scraping data from a list of webpages using Google Docs

Some basic principles

Scraping with Google Docs using =importXML – a case study

Finding the structure in the HTML

Scraping data from each link in a list

Troubleshooting with =IF

A hack, but an instructive one

How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)

Example question: “How many Swedish party leaders are mentioned in the cables?”

Testing with the API

Formatting your data

Grabbing from the API

Extracting from the JSON

SFTW: 9 data journalism tools

1. Junar – for scraping websites and sharing data

2. BuzzData – for sharing data

3. DataMarket – for finding data

4. Google News Scraper – for grabbing data on news coverage

5. Metadata extraction tool – a first step for searching document dumps?

6. Roambi – data visualisation on your iPhone

7. Data Wrangler – web-based data cleaning tool

8. Impure – visual programming language

9. Zanran – PDF/spreadsheet/table search engine

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

Converting an XML feed into a table

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

Converting an XML feed into a table

SFTW: How to grab useful political data with the They Work For You API

The They Work For You API key

Using the TWFY key

Using your key

The URL broken down further

Using with Google Refine to get a bunch of results

Formatting data for the API

Creating new columns from the JSON

Going further

SFTW: How to grab useful political data with the They Work For You API

The They Work For You API key

Using the TWFY key

Using your key

The URL broken down further

Using with Google Refine to get a bunch of results

Formatting data for the API

Creating new columns from the JSON

Going further

Quote Twitter conversations with QuoteURL (Something for the Weekend #15)