Category Archives: data journalism

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:

  1. http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521
  2. http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629
  3. http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0’) UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

More on how this data was used on Help Me Investigate Education.

The test of data journalism: checking the claims of lobbyists via government

Day 341 - Pull The Wool Over My Eyes - image by Simon James

Day 341 - Pull The Wool Over My Eyes - image by Simon James

While the public image of data journalism tends to revolve around big data dumps and headline-grabbing leaks, there is a more important day-to-day application of data skills: scrutinising the claims regularly made in support of spending public money.

I’m blogging about this now because I recently came across a particularly good illustration of politicians being dazzled by numbers from lobbyists (that journalists should be checking) in this article by Simon Jenkins, from which I’ll quote at length:

“This government, so draconian towards spending in public, is proving as casual towards dodgy money in private as were Tony Blair and Gordon Brown. Earlier this month the Olympics boss, Lord Coe, moseyed into Downing Street and said that his opening and closing ceremonies were looking a bit mean at £40m. Could he double it to £81m for more tinsel? Rather than scream and kick him downstairs, David Cameron said: my dear chap, but of course. I wonder what the prime minister would have said if his lordship had been asking for a care home, a library or a clinic.

“Much of the trouble comes down to the inexperience of ingenue ministers, and their susceptibility to the pestilence of lobbying now infecting Westminster. On this occasion the hapless Olympics minister, Hugh Robertson, claimed that the extra £41m was “worth £2-5bn in advertising revenue alone”, a rate of return so fanciful as to suggest a lobbyist’s lunch beyond all imagining. Robertson also claimed to need another £271m for games security (not to mention 10,000 troops, warships and surface-to-air missiles), despite it being “not in response to any specific security threat”. It was just money.

“This was merely the climax of naivety. In their first month in office, ministers were told – and believed – that it would be “more expensive” to cancel two new aircraft carriers than to build them. Ministers were told it would cost £2bn to cancel Labour’s crazy NHS computer rather than dump it in the nearest skip. Chris Huhne, darling of the renewables industry, wants to give it £8bn a year to rescue the planet, one of the quickest ways of transferring money from poor consumer to rich landowner yet found. The chancellor, George Osborne, was told by lobbyists he could save £3bn a year by giving away commercial planning permissions. All this was statistical rubbish.

“If local government behaved as credulously as Whitehall it would be summoned before the audit commission and subject to surcharge.”

And if you want to keep an eye on such claims, try a Google News search like this one.

20 free ebooks on journalism (for your Xmas Kindle)

For some reason there are two versions of this post on the site – please check the more up to date version here.

2011: the UK hyper-local year in review

In this guest post, Damian Radcliffe highlights some topline developments in the hyper-local space during 2011. He also asks for your suggestions of great hyper-local content from 2011. His more detailed slides looking at the previous year are cross-posted at the bottom of this article.

2011 was a busy year across the hyper-local sphere, with a flurry of activity online as well as more traditional platforms such as TV, Radio and newspapers.

The Government’s plans for Local TV have been considerably developed, following the Shott Review just over a year ago. We now have a clearer indication of the areas which will be first on the list for these new services and how Ofcom might award these licences. What we don’t know is who will apply for these licences, or what their business models will be. But, this should become clear in the second half of the year.

Whilst the Leveson Inquiry hasn’t directly been looking at local media, it has been a part of the debate. Claire Enders outlined some of the challenges facing the regional and local press in a presentation showing declining revenue, jobs and advertising over the past five years. Her research suggests that the impact of “the move to digital” has been greater at a local level than at the nationals.

Across the board, funding remains a challenge for many. But new models are emerging, with Daily Deals starting to form part of the revenue mix alongside money from foundations and franchising.

And on the content front, we saw Jeremy Hunt cite a number of hyper-local examples at the Oxford Media Convention, as well as record coverage for regional press and many hyper-local outlets as a result of the summer riots.

I’ve included more on all of these stories in my personal retrospective for the past year.

One area where I’d really welcome feedback is examples of hyper-local content you produced – or read – in 2011. I’m conscious that a lot of great material may not necessarily reach a wider audience, so do post your suggestions below and hopefully we can begin to redress that.

2 guest posts: 2012 predictions and “Social media and the evolution of the fourth estate”

Memeburn logo

I’ve written a couple of guest posts for Nieman Journalism Lab and the tech news site Memeburn. The Nieman post is part of a series looking forward to 2012. I’m never a fan of futurology so I’ve cheated a little and talked about developments already in progress: new interface conventions in news websites; the rise of collaboration; and the skilling up of journalists in data.

Memeburn asked me a few months ago to write about social media’s impact on journalism’s role as the Fourth Estate, and it took me until this month to find the time to do so. Here’s the salient passage:

“But the power of the former audience is a power that needs to be held to account too, and the rise of liveblogging is teaching reporters how to do that: reacting not just to events on the ground, but the reporting of those events by the people taking part: demonstrators and police, parents and politicians all publishing their own version of events — leaving journalists to go beyond documenting what is happening, and instead confirming or debunking the rumours surrounding that.

“So the role of journalist is moving away from that of gatekeeper and — as Axel Bruns argues — towards that of gatewatcher: amplifying the voices that need to be heard, factchecking the MPs whose blogs are 70% fiction or the Facebook users scaremongering about paedophiles.

“But while we are still adapting to this power shift, we should also recognise that that power is still being fiercely fought-over. Old laws are being used in new waysnew laws are being proposed to reaffirm previous relationships. Some of these may benefit journalists — but ultimately not journalism, nor its fourth estate role. The journalists most keenly aware of this — Heather Brooke in her pursuit of freedom of information; Charles Arthur in his campaign to ‘Free Our Data’ — recognise that journalists’ biggest role as part of the fourth estate may well be to ensure that everyone has access to information that is of public interest, that we are free to discuss it and what it means, and that — in the words of Eric S. Raymond — “Given enough eyeballs, all bugs are shallow“.”

Comments, as always, very welcome.

Tools or Tales?

Christmas gifts image by Michael Wyszomierski

Christmas gifts image by Michael Wyszomierski

This month’s Carnival of Journalism asks what journalists want for Christmas from programmers, and vice versa. Here’s my take.

Programmers and developers have already given journalists enough presents to last a century of Christmases. Programmers created content management systems and blogging platforms; they wrapped up networks of contacts in social networks, and parcelled up fast-moving updates on Twitter and SMS. They tied media in ribbons of metadata, making it easier to verify. They digitised content, making it possible to mix it with other content.

But I think it’s time for journalists to start giving back.

All of these gifts have made it easier for journalists to report stories. But that’s only part of publishing.

Technology’s place in journalism

Traditionally, journalism’s technology came after the story: sub-editors or designers laid the story out in the way they judged to be the most effective; printers gave it physical form; and distributors made sure it reached people.

Each stage in that process considers the next person. The inverted pyramid, for example, helps subs trim copy to fit available space. Subs talk to printers. Printers work with distributors. Processes are designed to reduce friction. The journalist’s work – whether they realise it or not – is a compromise reached over decades between different parties. An exchange of gifts, if you like.

But when it comes to publishing online, there’s been very little Christmas spirit.

Stories as a vehicle

Stories help us connect with current issues; they act as a vehicle for information that allows us to participate in society, whether that’s politically, socially, or economically.

The job of a journalist is to find stories in current events.

But those stories do not have to be told in one particular way. And if we were to try to tell them in some different ways (adding important metadata; publishing raw data; linking to supporting material; flagging false information), we could be giving a gift much desired by developers.

Here are some things that they could do with that gift – it is, if you like, my own fantasy Christmas list:

They’re just ideas – and will remain so as long as journalists assume they’re only writing for newspapers, and newspaper readers.

The newspaper is a tool: a way for groups of people to exchange information. In the 19th century those groups might have been political activists, or merchants who needed to know the latest trading conditions.

The web is a tool too – a different tool. We can use it to ask information to come to us, or to seek out supplementary information; we can use it to draw connections; and we can act on what we find in the same space. Stories need to adapt to the possibilities of the new tool they sit in.

This year, put a developer on your Christmas list. It’s the gift that keeps on giving.

New UK open data moves: following the money and other curiosities

Tim Davies has done a wonderful job of combing through the fine print of the UK government’s Autumn statement open data measures (PDF), highlighting the dynamics that appear to be driving it, and the data conspicuous by its absence.

Here are the passages most relevant for journalists. Firstly, following the money and accountability:

“The [Data Strategy Board] body seeking public data will be reliant upon the profitability of the PDG [Public Data Group] in order to have the funding it needs to secure the release of data that, if properly released in free forms, would likely undermine the current trading revenue model of the PDG. That doesn’t look like the foundation for very independent and effective governance or regulation to open up core reference data!

“Furthermore, whilst the proposed terms for the DSB [Data Strategy Board] terms state that “Data users from outside the public sector, including representatives of commercial re-users and the Open Data community, will represent at least 30% of the members of DSB”, there are also challenges ahead to ensure data users from civil society interests are represented on the board”

Secondly, the emphasis on clinical data and issues surrounding privacy and the sale of personal data:

“The first measures in the Cabinet Office’s paper are explicitly not about open data as public data, but are about the restricted sharing of personal medical records with life-science research firms – with the intent of developing this sector of the economy. With a small nod to “identifying specified datasets for open publication and linkage”, the proposals are more centrally concerned with supporting the development of a Clinical Practice Research Datalink (CPRD) which will contain interlinked ‘unidentifiable, individual level’ health records, by which I interpret the ability to identify a particular individual with some set of data points recorded on them in primary and secondary care data, without the identity of the person being revealed.

“The place of this in open data measures raises a number of questions, such as whether the right constituencies have been consulted on these measures and why such a significant shift in how the NHS may be handing citizens personal data is included in proposals unlikely to be heavily scrutinised by patient groups? In the past, open data policies have been very clear that ‘personal data’ is out of scope – and the confusion here raises risks to public confidence in the open data agenda. Leaving this issue aside for the moment, we also need to critically explore the evidence that the release of detailed health data will “reinforce the UK’s position as a global centre for research and analytics and boost UK life sciences”. In theory, if life science data is released digitally and online, then the firms that can exploit it are not only UK firms – but the return on the release of UK citizens personal data could be gained anywhere in the world where the research skills to work with it exist.”

UPDATE: More on that in The Guardian.

Thirdly, it looks like this data will allow journalists to scrutinise welfare and credit (so plenty of material for the tabloids and mid-market press), but not data that scrutinises corporations or governments:

“When we look at the other administrative datasets proposed for release in the Measures the politicisation of open data release is evident: Fit Note Data; Universal Credit Data; and Welfare Data (again discussed for ‘linking’ implying we’re not just talking about aggregate statistics) are all proposed for increased release, with specific proposals to “increase their value to industry”. By contrast, no mention of releasing more details on the tax share paid by corporations, where the UK issues arms export licenses, or which organisations are responsible for the most employment law violations. Although the stated aims of the Measures include increasing “transparency and accountability” it would not be unreasonable to read the detail of the measures as very one-sided on this point: and emphasising industry exploitation of data far more than good governance and citizen rights with respect to data.

“The blurring of the line between ‘personal data’ and ‘open data’, and the state’s assumption of the right to share personal data for industrial gain should give cause for concern, and highlights the need for build a stronger constituency scrutinising government open data action.”

It’s nice to see a data initiative being greeted with a critical eye rather than Three Cheers for the Numbers.

UPDATE: On a similar note, Access Info Europe highlights problems with the Open Government Partnership, which “must significantly improve its internal access to information policy to meet the standards it is advancing”. Specifically:

“The policy should be reformed to incorporate basic open data principles such as that information will be made available in a machine-readable, electronic format free of restrictions on reuse.”

“A key problem is the lack of detail in the policy, which has the result of leaving important matters to the discretion of the OGP. Other key problems include:
» The failure of the policy to recognise the fundamental human right to information;
» The significantly overbroad and discretionary regime of exceptions;
» The failure of the draft Policy to put in place a system of protections and sanctions.”

Maps “in the public interest” now exempt from Google Maps API charge

If you thought you couldn’t use the Google Maps API any more as a journalist, this update to the Google Geo Developers Blog should make you reconsider. From Nieman Journalism Lab:

“Certain web apps will be given blanket exemptions from charging. Here’s Google: “Maps API applications developed by non-profit organisations, applications deemed by Google to be in the public interest, and applications based in countries where we do not support Google Checkout transactions or offer Maps API Premier are exempt from these usage limits.” So nonprofit news orgs look to be in the clear, and Google could declare other news org maps apps to be “in the public interest” and free to run. (It also notes that nonprofits could be eligible for a free Maps API Premier license, which comes with extra goodies around advertising and more.)”