Liberating Data from the Guardian… Has it Really Come to This?

When the data is the story, should a news organisation make it available? When the Telegraph started trawling through MPs’ expenses data it had bought from a source, industry commentators started asking questions around whether it was the Telegraph’s duty to release that data (e.g. Has Telegraph failed by keeping expenses process and data to itself?).

Today, the Guardian released its University guide 2011: University league table, as a table:

Guardian university tables, sort of

Yes, this is data, sort of (though the javascript applied to the table means that it’s hard to just select and copy the data from the page – unless you turn javascript off, of course:

Data grab

but it’s not like the data that the Guardian are republishing it in their datastore, as they did with these league tables…:

Guardian datastore

…which was actually a republication of data from the THES… 😉

I’ve been wondering for some time when this sort of apparent duplicity was going to occur… the Guardian datastore has been doing a great job of making data available (as evidenced by its award from the Royal Statistical Society last week, which noted: “there was commendable openness with data, providing it in easily accessible ways”) but when the data is “commercially valuable” data to the Guardian, presumably in terms of being able to attract eyeballs to Guardian Education web pages, there seems to be some delay in getting the data onto the datastore… (at least, it isn’t there yet/wasn’t published contemporaneously the original story…)

I have to admit I’m a bit wary about writing this post – I don’t want to throw any spanners in the works as far as harming the work being done by the Datastore team – but I can’t not…

So what do we learn from this about the economics of data in a news environment?

– data has creation costs;
– there may be a return to be had from maintaining limited, priviliged or exclusive access to the data as data OR as information, where information is interpreted, contextualised or visualised data, or is valuable in the short term (as for example, in the case of financial news). By withholding access to data, publishers maintain the ability to generate views or analysis of the data that they can create stories, or attractive website content, around. (Just by the by, I noticed that an interactive Many Eyes widget was embedded in a Guardian Datablog post today🙂
– if you’ve incurred the creation cost, maybe you have a right to a limited period of exclusivity with respect to profiting from that content. This is what intellectual property rights try to guarantee, at least until the Mickey Mouse lawyers get upset about losing their exclusive right to profit from the content.

I think (I think) what the Guardian doing is not so different to what the Telegraph did. A cost was incurred, and now there is a (hopefully limited) period in which some sort of return is attempting to be generated. But there’s a problem, I think, with the way it looks, especially given the way the Guardian has been championing open data access. Maybe the data should have been posted to the datablog, but with access permissions denied until a stated date, so that at least people could see the data was going to be made available.

What this has also thrown up, for me at least, is the question as to what sort of “contract” the datablog might have, implied or otherwise, with third parties who develop visualisations based on data in the Guardian Datastore, particularly if those visualisations are embeddable and capable of generating traffic (i.e. eyeballs, = ad impressions, = income…).

It also gets me wondering; does there need to be a separate datastore? Or is the ideal case where the stories themselves are linking out to datasets directly? (I suppose that would make it hard to locate the data? On second thoughts, the directory datastore approach is much better…)

Related: Time for Or a local

PS I toyed with the idea of republishing all the data from the Guardian Education pages in a spreadsheet somewhere, and then taking my chances with the lawyers in the court of public opinion, but instead, here’s a howto:

Scraping data from the Grauniad

So just create a Google spreadsheet (you don’t even need an account: just go to, double click on cell A1 and enter:


and then you’ll be presented with the data, in a handy spreadsheet form, from:

For the subject pages – e.g. Agriculture, Forestry and Food, paste in something like:


You can probably see the pattern… 😉

(You might want to select all the previously filled cells and clear them first so you don’t get the data sets messed up. If you’ve got your own spreadsheet, you could always create a new sheet for each table. (It is also possible to automate the scraping of all the tables using Google Apps script: Screenscraping With Google Spreadsheets App Script and the =importHTML Formula gives an example how…))

An alternative route to the data is via YQL:

Scraping HTML table data in YQL

Enjoy…;-) And if you do grab the data and produce some interesting visualisations, feel free to post a link back here… 😉 To give you some ideas, here are a few examples of education data related visualisations I’ve played around with previously.

PPS it’ll be interested to see if this post gets picked up by the Datablog, or popped into the Guardian Technology newsbucket… 😉 Heh heh…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.