Category Archives: data journalism

Tell the government what you want from the Public Data Corporation

Public Data Corporation consultation

If who are excited about the prospect of open data, but frustrated by its execution (or just one of those people who complain that data doesn’t change anything), the government are inviting comments on what shape the Public Data Corporation should take.

It’s a refreshingly simple execution: a WordPress blog with each question as a separate blog post – presumably it cost a lot less than £300,000. But of course the questions are theirs, and they are:

1.      Which public sector datasets do you currently make use of?

2.      How easy is it to find out what datasets are held by public sector organisations?

3.      How do you, or would you, decide whether a dataset has value for you or for your organisation? What affects how valuable they are, for example timeliness, granularity, format?

4.      Which datasets are of most value to you or your organisation? Why?

5.      What methods of access to datasets would most benefit you or your organisation?

6.      What gets in the way of you or your organisation accessing datasets or data products?

7.      What are the most exciting applications of datasets or data products you are aware of – here or internationally? We are, again, particularly interested in the following areas: registration activities, environmental science, critical infrastructure and the built environment.

8.      Are there any datasets or products you’d like to see generated? How would you or your organisation use them, and what social or economic benefits do you think they would deliver?

9.      From your perspective, what would success look like for the Public Data Corporation?

10.  Have we got the name for this organisation right?  Do you have any suggestions on naming that might better convey our aims?

It’s a shame that there isn’t any space for more open discussion – and that so many of the questions resemble market research. But still, the more journalists who pile in – the more justifiably we can moan later. So go ahead.

Post your responses here.

3 things that BBC Online has given to online journalism

It’s now 3 weeks since the BBC announced 360 online staff were to lose their jobs as part of a 25% cut to the online budget. It’s a sad but unsurprising part of a number of cuts which John Naughton summarises as: “It’s not television”, a sign that “The past has won” in the internal battle between those who saw consumers as passive vessels for TV content, and those who credited them with some creativity.

Dee Harvey likewise poses the question: “In the same way that openness is written into the design of the Internet, could it be that closedness is written into the very concept of the BBC?”

If it is, I don’t think it can remain that way for ever. Those who have been part of the BBC’s work online will feel rightly proud of what has been achieved since the corporation went online in 1997. Here are just 3 ways that the corporation has helped to define online journalism as we know it – please add others that spring to mind:

1. Web writing style

The BBC’s way of writing for the web has always been a template for good web writing, not least because of the BBC’s experience with having to meet similar challenges with Ceefax – the two shared a content management system and journalists writing for the website would see the first few pars of their content cross-published on Ceefax too.

Even now it is difficult to find an online publisher who writes better for the web.

2. Editors blogs

Thanks to the likes of Robin Hamman, Martin Belam, Jem Stone and Tom Coates – to name just a few – when the BBC did begin to adopt blogs (it was not an early adopter) it did so with a spirit that other news organisations lacked.

In particular, the Editors’ Blogs demonstrated a desire for transparency that many other news organisations have yet to repeat, while the likes of Robert Peston, Kevin Anderson and Rory Cellan-Jones have played a key role in showing skeptical journalists how engaging with the former audience on blogs can form a key part of the newsgathering process.

Unfortunately, many of those innovators later left the BBC, and the earlier experimentation was replaced with due process.

3. Backstage

While so many sing and dance about the APIs of The Guardian and The New York Times, Ian Forrester’s BBC Backstage project was well ahead of the game when it opened up the corporation’s API and started hosting hack days and meetups way back in 2005.

Backstage closed at the end of last year, just as the rest of the UK’s media were starting to catch up. You can read an e-book on its history here.

What else?

I’m sure you can add others – the iPlayer and their on-demand team; Special Reports; the UGC hub (the biggest in the world as far as I know); and even their continually evolving approach to linking (still not ideal, but at least they think about it) are just some that spring to mind. What parts of BBC Online have influenced or inspired you?

3 new resources for data journalists

There have been a raft of new sites for data launched in the past couple of months which I haven’t had time to blog about, so here’s a quick round-up:

  • Tim DaviesOpen Data Cookbook aims to collect “step by step recipes for practical ways to use open data” – a useful complement to GetTheData. The recipes are currently aimed at the more technically minded but you know what to do to address that…
  • Is It Open Data? aims to “make it easy for people to make enquires of data holders, about the openness of the data they hold — and to record publicly the results of those efforts.”
  • And for those wishing to publish open data, The Open Data Manual provides information on what open data is, why you should publish open data, and how to do it. If you come up against an organisation that does not know how to publish their data in an open format, or needs convincing of why they should do so, this is a good place to point them to (or learn the arguments from).

If you’ve seen any other useful resources of late, please post a link in the comments.

Why journalists should be lobbying over police.uk’s crime data

UK police crime maps

Conrad Quilty-Harper writes about the new crime data from the UK police force – and in the process adds another straw to the groaning camel’s back of the government’s so-called transparency agenda:

“It’s useless to residents wanting to find out what was going on at the house around the corner at 3am last night, and it’s useless to individuals who want to build mobile phone applications on top of the data (perhaps to get a chunk of that £6 billion industry open data is supposed to create).

“The site’s limitations are as follows:

  • No IDs for crimes: what if I want to check whether real life crimes have made it onto the map? Sorry.
  • Six crime categories: including “other crimes”, everything from drug dealing to bank robberies in one handy, impossible to understand category.
  • No live data: you mean I have to wait until the end of the next month to see this month’s criminality?!
  • No dates or times: funny how without dates and times I can’t tell which police manager was in charge.
  • Case status: the police know how many crimes go solved or unsolved, why not tell us this?”

This is why people are so concerned about the Public Data Corporation. This is why we need to be monitoring exactly what spending data councils release, and in what format. And this is why we need to continue to press for the expansion of FOI laws. This is what we should be doing. Are we?

UPDATE: Will Perrin has FOI’d all correspondence relating to ICO advice on the crime maps. Jonathan Raper has a list of further flaws including:

  • Some data such as sexual offences and murder is removed – even though it would be easy to discover and locate from other police reports.
  • Data covers reported crimes rather than convictions, so some of it may turn out not to be crime.
  • The levels of policing are not provided, so that two areas with the “same” crime levels may in fact have “radically different” experiences of crime and policing.

Charles Arthur notes that: “Police forces have indicated that whenever a new set of data is uploaded – probably each month – the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it.”

Louise Kidney says:

“What we’ve actually got with http://www.police.uk is neither one nor the other. Ruth looks like a crime overlord cos of all the crimes happening in her garden and we haven’t got exact point data, but we haven’t got first part of postcode data either e.g. BB5 crimes or NW1 crimes. Instead, we’ve got this weird halfway house thing where it’s not accurate, but its inaccuracy almost renders it useless because we don’t have any idea if every force uses the same parameters when picking these points, we don’t know how they pick their points, we don’t know what we don’t know in terms of whether one house in particular is causing a considerable issue with anti-social behaviour for example, allowing me to go to my local Council and demand they do something about it.”

Adrian Short argues that “What we’re looking at here isn’t a value-neutral scientific exercise in helping people to live their daily lives a little more easily, it’s an explicitly political attempt to shape the terms of a debate around the most fundamental changes in British policing in our lifetimes.”

He adds:

“It’s derived data that’s already been classified, rounded and lumped together in various ways, with a bit of location anonymising thrown in for good measure. I haven’t had a detailed look at it yet but I would caution against trying to use it for anything serious. A whole set of decisions have already transformed the raw source data (individual crime reports) into this derived dataset and you can’t undo them. You’ll just have to work within those decisions and stay extremely conscious that everything you produce with it will be prefixed, “as far as we can tell”.

“£300K for this? There ought to be a law against it.”

UPDATE 2: One frustrated developer has launched CrimeSearch.co.uk to provide “helpful information about crime and policing in your area, without costing 300k of tax payers’ money”

Getting Started With Local Council Spending Data

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
– compare the offerings with each other in more detail, e.g. the columns each view provides;
– compare the offerings with the guidance on release of council spending data;
– see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
– explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard

Investigations tool DocumentCloud goes public (PS: documents drive traffic)

The rather lovely DocumentCloud – a tool that allows journalists to share, annotate, connect and organise documents – has finally emerged from its closet and made itself available to public searches.

This means that anyone can now search the powerful database (some tips here) of newsworthy documents. If you want to add your own, however, you still need approval.

If you do end up on this list you’ll find it’s quite a powerful tool, with quick conversion of PDFs into text files, analytic tools and semantic tagging (so you can connect all documents with a particular person, or organisation) among its best features. The site is open source and has an API too.

I asked Program Director Amanda B Hickman what she’s learned on the project so far. Her response suggests that documents have a particular appeal for online readers:

“If we’ve learned anything, it is that people really love documents. It is pretty clear that when there’s something interesting going on in the news, plenty of people want to dig a little deeper. When Arizona Republic posted an annotated version of that state’s new immigration law, it got more traffic than their weekly entertainment round up. WNYC told us that the page listing the indictments in last week’s mob roundup was still getting more traffic than any other single news story even a week later.

“These were big news documents, to be sure, but it still seems pretty clear that people do want to dig deeper and explore the documents behind the news, which is great for us and great for news.”

Where do I get that data? New Q&A site launched

Get the Data

Well here’s another gap in the data journalism process ever-so-slightly plugged: Tony Hirst blogs about a new Q&A site that Rufus Pollock has built. Get the Data allows you to “ask your data related questions, including, but not limited to, the following:

  • “where to find data relating to a particular issue;
  • “how to query Linked Data sources to get just the data set you require;
  • “what tools to use to explore a data set in a visual way;
  • “how to cleanse data or get it into a format you can work with using third party visualisation or analysis tools.”

As Tony explains (the site came out of a conversation between him and Rufus):

“In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of.”

He also invites people to help populate the site:

“If you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

“If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.”

Off you go then.

Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers

Where can I find a list of hospitals in the UK along with their location data? Or historical weather data for the UK? Or how do I find the county from a postcode, or a book title from its ISBN? And is there any way you can give me RDF Linked Data in a format I can actually use?!

With increasing amounts of data available, it can still be hard to:

– find the data you you want;
– query a datasource to return just the data you want;
– get the data from a datasource in a particular format;
– convert data from one format to another (Excel to RDF, for example, or CSV to JSON);
– get data into a representation that means it can be easily visualised using a pre-existing tool.

In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of. Or maybe you know the data is in Linked Data store on data.gov.uk, but you just can’t figure how to get it out?

This is where GetTheData.org comes in. Get The Data arose out of a conversation between myself and Rufus Pollock at the end of last year, which resulted with Rufus setting up the site now known as getTheData.org.

getTheData.org

The idea behind the site is to field questions and answers relating to the practicalities of working with public open data: from discovering data sets, to combining data from different sources in appropriate ways, getting data into formats you can happily work with, or that will play nicely with visualisation or analysis tools you already have, and so on.

At the moment, the site is in its startup/bootstrapping phase, although there is already some handy information up there. What we need now are your questions and answers…

So, if you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

If you’re looking for data as part of a research project, but can’t find it or can’t get it in an appropriate form that lets you link it to another data set, post a question to GetTheData.

If you want to do some graphical analysis on a data set, but don’t know what tool to use, or how to get the data in the right format for a particular tool, that’d be a good question to ask too.

Which is to say: if you want to GetTheData, but can’t do so for whatever reason, just ask… GetTheData.org

A portal for European government data: PublicData.eu plans

The Open Knowledge Foundation have published a blog post with notes on a site they’re developing to gather together data from across Europe. The post notes that the growth of data catalogues at both a national level (mentioning the Digitalisér.dk data portal run by the Danish National IT and Telecom Agency) and “countless city level initiatives across Europe as well – from Helsinki to Munich, Paris to Zaragoza.” with many more initiatives “in the pipeline with plans to launch in the next 6 to 12 months.”

PublicData.eu will, it says:

“Provide a single point of access to open, freely reusable datasets from numerous national, regional and local public bodies throughout Europe.

“[It] will harvest and federate this information to enable users to search, query, process, cache and perform other automated tasks on the data from a single place. This helps to solve the “discoverability problem” of finding interesting data across many different government websites, at many different levels of government, and across the many governments in Europe.”

What is perhaps even more interesting for journalists is that the site plans to:

“Capture (proposed) edits, annotations, comments and uploads from the broader community of public data users.”

That might include anything from cleaner versions of data, to instances where developers match datasets together, or where users add annotations that add context to a particular piece of information.

Finally there’s a general indication that the site hopes to further lower the bar for data and collaborative journalism by:

“Providing basic data analysis and visualisation tools together with more in-depth resources for those looking to dig deeper into the data. Users will be able to personalise their data browsing experience by being able to save links and create notes and comments on datasets.”

More in the post itself. Worth keeping an eye on.