Tagged: opendata

Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

- the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
- the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

head recordlevel.csv

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Isle fo wight sentencing data

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

Visualising data in google fusion tables

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Data exploration in Google FUsion tables

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

recordlevel <- read.csv("~/data/recordlevel.csv")
iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

R - bar plot

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw$AGE, iw$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R barplot - offences on IW

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R - stacked bar chart

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

GGPlot2 in R

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

ggplot facet barplot

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

ggplot with stacked factor

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

yeroon.net/ggplot2

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])

We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)


Two New Cabinet Office Open Data Consultations: Data Policy and Making Open Data Real

Via the Guardian Datablog, I see that the Cabinet Office has just opened up a couple of consultations around open data:

- Consultation on Data Policy for a Public Data Corporation [homepage] [Consultation]

Here are the consultation questions (also available via SurveyMonkey: PDC consultation):

Chapter 4 – Charging for PDC information

  1. How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability? Please provide evidence to support your answer where possible.
  2. Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and re-use? Who would these benefit and how? Please provide evidence to support your answer where possible.
  3. What do you think the impacts of the three options would be for you and/or other groups outlined above? Please provide evidence to support your answer where possible.
  4. A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?
  5. Are there any alternative options that might balance Government’s objectives which are not covered here? Please provide details and evidence to support your response where possible.

Chapter 5 – Licensing

  1. To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC?
  2. To what extent do you think each of the options set out would address those issues (or any others)? Please provide evidence to support your comments where possible.
  3. What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments
  4. Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?

Chapter 6 – Regulatory oversight

  1. To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?
  2. Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?
  3. What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)?

And the second consultation (which is probably worth reading in the context of the http://www.cabinetoffice.gov.uk/resource-library/open-public-services-white-paper [white paper PDF, feedback website]?)

- Making Open Data Real: A Public Consultation [homepage] [Consultation]

  1. Glossary of key terms [link]
  2. An enhanced right to data: how do we establish stronger rights for individuals, businesses and other actors to obtain, use and re-use data from public service providers? [link]
  3. Setting transparency standards: what would standards that enforce this right to data among public authorities look like? [link]
  4. Corporate and personal responsibility: how would public service providers be held to account for delivering open data through a clear governance and leadership framework at political, organisational and individual level? [link]
  5. Meaningful Open Data: how should we ensure collection and publication of the most useful data, through an approach enabling public service providers to understand the value of the data they hold and helps the public at large know what data is collected? [link]
  6. Government sets the example: in what ways could we make the internal workings of government and the public sector as open as possible? [link]
  7. Innovation with Open Data: to what extent is there a role for government to stimulate enterprise and market making in the use of open data? [link]

I haven’t had chance to read through the consultation docs yet, but I’ll try and comment somewhere, as well as responding…

The way the consultations are presented

As to the way the consultations are presented themselves, two approaches have been taken:

- the PDC consultation embeds documenents at chapter level hosted on Scribd in a preview widget, with questions made available via a Word document or via SurveyMonkey. There doesn’t appear to be an opportunity to comment on the BIS site that is hosting the PDC consultation, even though it’s a WordPress platform running the Commentariat2 theme. To my mind, the way this consultation has be published, it’s not really of the web, and, to use a technical term, feels a little bit horrible to me… Maybe they don’t want flame wars on the bis.gov.uk domain about “Charging for PDC information”?!;-)

- the Making it Real consultation is hosted on the data.gov.uk site, with HTML text split at “chapter” (section) level, and commenting at that level via a single bottom of the page comment box. Where documents take close reading, I think this makes commenting difficult: if you want to refer to specific, detailed points in the consultation document, I’d say it makes sense to be able to see comment at the point of reference. That is, the comment box needs to be where you can see the actual bit of text you are commenting on (which is one reason why we often annotate documents with marginalia, rather than on a separate piece of paper). Where the comment box is fixed at the bottom of the page, you need two windows open to have side by side commenting and viewing of the actual text you are commenting on.

If we hadn’t decided that things had moved on enough in the way consultations were being handled to close WriteToReply (WriteToReply is closing. Come get your data if you want it), I think there’s a good chance we would have hosted both these consultations… Maybe our thinking that WriteToReply had nudged things far enough was a bit hasty? (The digress.it theme is out there, but as yet hasn’t been trialled on a departmental basis, I don’t think, even though we did try to respond to the commissioned accessibility audit. (Are Scribd docs accessible?) Digress.it is running on the JISCPress site though.

(I’m suddenly fired up again by the thought that consultation docs could be so much more “of the web” as well as easier to engage with… Hmmm, when’s the closing date for these consultations? Maybe there is time for one last WriteToReply outing…?)

PS How did I miss out on subscribing to the Government Digital Service? e.g. Neil Williams on A vision for online consultation and policy engagement


Getting Started With Local Council Spending Data

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
- compare the offerings with each other in more detail, e.g. the columns each view provides;
- compare the offerings with the guidance on release of council spending data;
- see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
- explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard


Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)


So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)


An open letter to Tim Berners-Lee about open government

Following the tone set so succinctly by Glyn Moody, I thought I would add my own thoughts on what Sir Tim should say to the government when he bends their ear on transparency.

Firstly, I would second everything that Glyn says.

But I’m going to be cynical and strategic, and urge Sir Tim to emphasise the importance of open data on a couple of areas that are close to the government’s hearts.

1. Stimulating growth in the economy.

You could compare a genuinely significant release of public data to an economic stimulus.

Like cutting VAT, only cheaper.

At minimal cost you could have a new raw material that startups and established media organisations alike could create new value out of. Some of those would create commercial implications far exceeding any revenue generated within government (as research recently suggested in relation to the comparably valuable Ordnance Survey data).

Repeat after me: jobs and money, jobs and money.

2. Efficiencies and passing on costs in the public sector

Samuel Butler’s Erewhon puts it particularly well:

You will sooner gain your end by “appealing to men’s pockets, in which they have generally something of their own, than to their heads, which contain for the most part little but borrowed or stolen property”

Public sector spending is going to drop whichever party is in power. Let’s play to that.

By opening up public data the government will effectively be able to pass on some development costs to willing volunteers who mash up the data in their own ways. The difference is that people will do this to their own agendas and for their own benefit.

But more importantly, the results of this experimentation – if supported and encouraged – should produce work that makes it more efficient to interact with public data and therefore public bodies. If I can use a slider to find out which schools are within 3 miles, that saves 20 minutes of someone answering a phonecall in the local education department. If I can have a Facebook app which tells other users how much money alcohol abuse is costing my local hospital, it might save the NHS a bob or two. You get the picture. 

Oh yes, and it’s important for democracy, civic engagement and digital literacy

The limited data that’s available in the UK is an embarrassment. Imagine what MySociety could do with what’s available in the US.

Likewise, for all the talk of transparency, the recent announcement that Cabinet Papers and information relating to the Royal Family would be exempt from the Freedom of Information act is a backward step. Heather Brooke’s concerns proved right.

The cynic in me sees the appointment of Berners-Lee as an action intended to generate the illusion of movement – “We’re working on it”. But the Freedom of Information act is possibly the most positive contribution the Labour government has made to this country’s political health since it came to power, and not to follow through on promises made would be an enormous political mistake.

So I will add one request to my advice above: I would stress that any discussion of transparency acknowledges the importance of requiring any organisation using public funds to make their data public too. So much public work is outsourced to the private sector that it is particularly difficult to see whether public money is spent responsibly.

More at Podnosh, BBC, Emma Mulqueeny, Simon Dickson and Amused Cynicism.