Tag Archives: data

Why open data matters – a (very bad) example from Universal Jobmatch

Open Data stickers image by Jonathan Gray

I come upon examples of bad practice in publishing government data on a regular basis, but the Universal Jobmatch tool is an example so bad I just had to write about it. In fact, it’s worse than the old-fashioned data service that preceded it.

That older service was the Office for National Statistics’ labour market service NOMIS, which published data on Jobcentre vacancies and claimants until late 2012, when Jobcentre Plus was given responsibility for publishing the data using their Universal Jobmatch tool.

Despite a number of concerns, more than a year on, Universal Jobmatch‘s reports section has ignored at least half of the public data principles first drafted by the Government’s Public Sector Transparency Board in 2010, and published in 2012. Continue reading →

Data in Scotland on Help Me Investigate

1 Reply

If you’re interested in doing data journalism in or about Scotland there’s a post over on the Help Me Investigate blog rounding up a number of useful sources (I collected them as part of some two-day data journalism training sessions I have been delivering within a Scottish newspaper publisher). Continue reading →

Searching for a Map of Designated Public Places…

A discussion, earlier, about whether it was now illegal to drink in public…

…I thought not, think not, at least, not generally… My understanding was, that local authorities can set up controlled, alcohol free zones and create some sort of civil offence for being caught drinking alcohol there. (As it is, councils can set up regions where public consumption of alcohol may be prohibited and this prohibition may be enforced by the police.) So surely there must be an #opendata powered ‘no drinking here’ map around somewhere? The sort of thing that might result from a newspaper hack day, something that could provide a handy layer on a pub map? I couldn’t find one, though…

I did a websearch, turned up The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007, which does indeed appear to be the bit of legislation that regulates drinking alcohol in public, along with a link to a corresponding guidance note: Home Office circular 013 / 2007:

16. The provisions of the CJPA [Criminal Justice and Police Act 2001, Chapter 2 Provisions for combatting alcohol-related disorder] should not lead to a comprehensive ban on drinking in the open air.
…
17. It is the case that where there have been no problems of nuisance or annoyance to the public or disorder having been associated with drinking in that place, then a designation order … would not be appropriate. However, experience to date on introducing DPPOs has found that introducing an Order can lead to nuisance or annoyance to the public or disorder associated with public drinking being displaced into immediately adjacent areas that have not been designated for this purpose. … It might therefore be appropriate for a local authority to designate a public area beyond that which is experiencing the immediate problems caused by anti-social drinking if police evidence suggests that the existing problem is likely to be displaced once the DPPO was in place. In which case the designated area could include the area to which the existing problems might be displaced.

Creepy, creep, creep…

This, I thought, was interesting too, in the guidance note:

37. To ensure that the public have full access to information about designation orders made under section 13 of the Act and for monitoring arrangements, Regulation 9 requires all local authorities to send a copy of any designation order to the Secretary of State as soon as reasonably practicable after it has been made.

38. The Home Office will continue to maintain a list of all areas designated under the 2001 Act on the Home Office website: www.crimereduction.gov.uk/alcoholorders01.htm [I’m not convinced that URL works any more…?]

39. In addition, local authorities may wish to consider publicising designation orders made on their own websites, in addition to the publicity requirements of the accompanying Regulations, to help to ensure full public accessibility to this information.

So I’m thinking: this sort of thing could be a great candidate for a guidance note from the Home Office to local councils recommending ways of releasing information about the extent of designation orders as open geodata. (Related? Update from ONS on data interoperability (“Overcoming the incompatibility of statistical and geographic information systems”).)

I couldn’t immediately find a search on data.gov.uk that would turn up related datasets (though presumably the Home Office is aggregating this data, even if it’s just in a filing cabinet or mail folder somewhere*), but a quick websearch for Designated Public Places site:gov.uk intitle:council turned up a wide selection of local council websites along with their myriad ways of interpreting how to release the data. I’m not sure if any of them release the data as geodata, though? Maybe this would be an appropriate test of the scope of the Protection of Freedoms Act Part 6 regulations on the right to request data as data (I need to check them again…)?

* The Home Office did release a table of designated public places in response to an FOI request about designated public place orders, although not as data… But it got me wondering: if I scheduled a monthly FOI request to the Home Office requesting the data on a monthly basis, would they soon stop fulfilling the requests as timewasting? How about if we got a rota going?! Is there any notion of a longitudinal/persistent FOI request, that just keeps on giving (could I request the list of designated public places the Home Office has been informed about over the last year, along with a monthly update of requests in the previous month (or previous month but one, or whatever is reasonable…) over the next 18 months, or two years, or for the life of the regulation, or until such a time as the data is published as open data on a regular basis?

As for the report to government that a local authority must make on passing a designation order – 9. A copy of any order shall be sent to the Secretary of State as soon as reasonably practicable after it has been made. – it seems that how the area denoted as a public space is described is moot: “5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“. Hmmm, two things jump out there…

Firstly, a local authority shall cause to be published in a newspaper circulating in its area [my emphasis; how is a newspaper circulating in its area defined? Do all areas of England have a non-national newspaper circulating in that area? Does this implicitly denote some “official channel” responsibility on local newspapers for the communication of local government notices?]. Hmmm…..

Secondly, the area identified specifically or by description. On commencement, the order must also be made public by “identifying the place which has been identified in the order”, again “in a newspaper circulating in its area”. But I wonder – is there an opportunity there to require something along the lines of and published using an appropriate open data standard in a open public data repository, and maybe further require that this open public data copy is the one that is used as part of the submission informing the Home Office about the regulation? And if we go overboard, how about we further require that each enacted and proposed order is published as such along with a machine readable geodata description and that a single aggregate files containing all that Local Authority’s currently and planned Designated Public Spaces are also published (so one URL for all current spaces, one for all planned ones). Just by the by, does anyone know of any local councils publishing boundary date/shapefiles that mark out their Designated Public Spaces? Please let me know via the comments, if so…

A couple of other, very loosely (alcohol) related, things I found along the way:

Local Alcohol Profiles for England: the aim appears to have been the collation of, and a way of exploring, a “national alcohol dataset”, that maps alcohol related health indicators on a PCT (Primary Care Trust) and LA (local authority) basis. What this immediately got me wondering was: did they produce any tooling, recipes or infrastructure that would it make a few clicks easy to pull together a national tobacco dataset and associated website, for example? And then I found the Local Tobacco Control Profiles for England toolkit on the London Health Observatory website, along with a load of other public health observatories and it made me remember – again – just how many data sensemaking websites there already are out there…
UK Alcohol Strategy – maybe some leads into other datasets/data stories?

PS I wonder if any of the London Boroughs or councils hosting regional events have recently declared any new Designated Public Spaces #becauseOfTheOlympics.

Inter-Council Payments and the Google Fusion Tables Network Graph

One of the great things about aggregating local spending data from different councils in the same place – such as on OpenlyLocal – is that you can start to explore structural relations in the way different public bodies of a similar type spend money with each other.

On the local spend with corporates scraper on Scraperwiki, which I set up to scrape how different councils spent money with particular suppliers, I realised I could also use the scraper to search for how councils spent money with other councils, by searching for suppliers containing phrases such as “district council” or “town council”. (We could also generate views to to see how councils wre spending money with different police authorities, for example.)

(The OpenlyLocal API doesn’t seem to work with the search, so I scraped the search results HTML pages instead. Results are paged, with 30 results per page, and what seems like a maximum of 1500 (50 pages) of results possible.)

The publicmesh table on the scraper captures spend going to a range of councils (not parish councils) from other councils. I also uploaded the data to Google Fusion tables (public mesh spending data), and then started to explore it using the new network graph view (via the Experiment menu). So for example, we can get a quick view over how the various county councils make payments to each other:

Hovering over a node highlights the other nodes its connected to (though it would be good if the text labels from the connected nodes were highlighted and labels for unconnected nodes were greyed out?)

(I think a Graphviz visualisation would actually be better, eg using Canviz, because it can clearly show edges from A to B as well as B to A…)

As with many exploratory visualisations, this view helps us identify some more specific questions we might want to ask of the data, rather than presenting a “finished product”.

As well as the experimental network graph view, I also noticed there’s a new Experimental View for Google Fusion Tables. As well as the normal tabular view, we also get a record view, and (where geo data is identified?) a map view:

What I’d quite like to see is a merging of map and network graph views…

One thing I noticed whilst playing with Google Fusion Tables is that getting different aggregate views is rather clunky and relies on column order in the table. So for example, here’s an aggregated view of how different county councils supply other councils:

In order to aggregate by supplied council, we need to reorder the columns (the aggregate view aggregates columns as thet appear from left to right in the table view). From the Edit column, Modify Table:

(In my browser, I then had to reload the page for the updated schema to be reflected in the view). Then we can get the count aggregation:

It would be so much easier if the aggregation view allowed you to order the columns there…

PS no time to blog this properly right now, but there are a couple of new javascript libraries that are worth mentioning in the datawrangling context.

In part coming out of the Guardian stable, Misoproject is “an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content”. The initial dataset library provides a set of routines for: loading data into the browser from a variety of sources (CSV, Google spreadsheets, JSON), including regular polling; creating and managing data tables and views of those tables within the browser, including column operations such as grouping, statistical operations (min, max, mean, moving average etc); playing nicely with a variety of client side graphics libraries (eg d3.js, Highcharts, Rickshaw and other JQuery graphics plugins).

Recline.js is a library from Max Ogden and the Open Knowledge Foundation that if its name is anything to go by is positioning itself as an alternative (or complement?) to Google Refine. To my mind though, it’s more akin to a Google Fusion Tables style user interface (“classic” version) wherever you need it, via a Javascript library. The data explorer allows you to import and preview CSV, Excel, Google Spreadsheet and ElasticSearch data from a URL, as well as via file upload (so for example, you can try it with the public spend mesh data CSV from Scraperwiki). Data can be sorted, filtered and viewed by facet, and there’s a set of integrated graphical tools for previewing and displaying data too. Refine.js views can also be shared and embedded, which makes this an ideal tool for data publishers to embed in their sites as a way of facilitating engagement with data on-site, as I expect we’ll see on the Data Hub before too long.

More reviews of these two libraries later…

PPS These are also worth a look in respect of generating visualisations based on data stored in Google spreadsheets: DataWrapper and Freedive (like my old Guardian Datastore explorer, but done properly… Wizard led UI that helps you create your own searchable and embeddable database view direct from a Google Spreadsheet).

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV😉 The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Excel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…

Exploring GP Practice Level Prescribing Data

Some posts I get a little bit twitchy about writing. Accessing and Visualising Sentencing Data for Local Courts was one, and this is another: exploring practice level prescription data (get the data).

One of the reasons it feels “dangerous” is that the rationale behind the post is to demonstrate some of the mechanics of engaging with the data at a context free level, devoid of any real consideration about what the data represents, whilst using a data set that does have meaning, the interpretation of which can be used as the basis of making judgements about various geographical areas, for example.

The datasets that are the focus of this post relate to GP practice level prescription data. One datafile lists GP practices (I’ve uploaded this to Google Fusion tables), and includes practice name, identifier, and address. I geocoded the Google Fusion tables version of the data according to practice postcode, so we can see on a map how the practices are distributed:

(There are a few errors in the geocoding that could probably be fixed by editing the correspond data rows, and adding something like “, UK” to the postcode. (I’ve often thought it would be handy if you could force Google Fusion Table’s geocoder to only return points within a particular territory…))

The prescription data includes data at the level of item counts by drug name or prescription item per month for each practice. Trivially, we might do something like take the count of methadone prescriptions for each practice, and plot a map sizing points at the location of each practice by the number of methadone prescriptions by that practice. All well and good if we bear in mind the fact the the data hasn’t been normalised by the size of the practice, doesn’t take into account the area over which the patients are distributed, doesn’t take into account the demographics of the practices constituency (or recognise that a particular practice may host a special clinic, or the sample month may have included an event that drew in a large transient population with a particular condition, or whatever). A good example to illustrate this taken from another context might be “murder density” in London. It wouldn’t surprise me if somewhere like Russell Square came out as a hot spot – not because there are lots of murders there, but because a bomb went off on a single occasion killing multiple people… Another example of “crime hot spots” might well be courts or police stations, places that end up being used as default/placeholder locations if the actual location of crime isn’t known. And so on.

The analyst responsible for creating quick and dirty sketch maps will hopefully be mindful of the factors that haven’t been addressed in the construction of a sketch, and will consequently treat with suspicion any result unless they’ve satisfied themselves that various factors have been taken into account, or discount particular results that are not the current focus of the question they are asking themselves of the data in a particular way.

So when it comes to producing a post like this looking at demonstrating some practical skills, care needs to be taken not to produce charts or maps that appear to say one thing when indeed they say nothing… So bear that in mind: this post isn’t about how to generate statistically meaningful charts and tables; it’s about mechanics of getting rows of data out of big files and into a form we can start to try to make sense of them

Another reason I’m a little twitchy about this post relates to describing certain skills in an open and searchable/publicly discoverable forum. (This is one reason why folk often demonstrate core skills on “safe” datasets or randomly generated data files.) In the post Googling Nasties and Oopses on University and Public Sector Websites, a commenter asked: “is it really ethical to post that information?” in the context of an example showing how to search for confidential spreadsheet information using a web search engine. I could imagine a similar charge being leveled at a post that describes certain sorts of data wrangling skills. Maybe some areas of knowledge should be limited to the priesthood..?

To mitigate against any risks of revealing things best left undiscovered, I could draw on the NHS Information Centre’s Evaluation and impact assessment – proposal to publish practice-level prescribing data[PDF] as well as the risks acknowledged by the recent National Audit Office report on Implementing transparency (risks to privacy, of fraud, and other possible unintended consequences). But I won’t, for now…. (dangerrrrrroussssssssss…;-)

(Academically speaking, it might be interesting to go through the NHS Info Centre’s risk assessment and see just how far we can go in making those risks real using the released data set as a “white hat data hacker”, for example! I will go through the risk assessment properly in another post.)

So… let the journey into the data begin, and the reason why I felt the need to have a play with this data set:

Note: Due to the large file size (over 500MB) standard spreadsheet applications will not be able to handle the volumes of data contained in the monthly datasets. Data users will need to analyse the information using specialist data-handling software.

Hmmm… that’s not very accessible is it?!

However, if you’ve read my previous posts on Playing With Large (ish) CSV Files or Postcards from a Text Processing Excursion, or maybe even the aforementioned local sentencing data post, you may have some ideas about how to actually work with this file…

So fear not – if you fancy playing along, you should already be set up tooling wise if you’re on a Mac or a Linux computer. (If you’re on a Windows machine, I cant really help – you’ll probably need to install something like gnuwin or Cygwin – if any Windows users could add support in the comments, please do:-)

Download the data (all 500MB+ of it – it’s published unzipped/uncompressed (a zipped version comes in at a bit less than 100MB)) and launch a terminal.

I downloaded the December 2011 files as nhsPracticesDec2011.csv and nhsPrescribingDataDec2011.CSV so those are the filenames I’ll be using.

To look at the first few lines of each file we can use the head command:

head nhsPrescribingDataDec2011.CSV head nhsPracticesDec2011.csv

Inspection of the practices data suggests that counties for each practice are specified, so I can generate a subset of the practices file listing just practices on the ISLE OF WIGHT by issuing a grep (search) command and sending (>) the result to a new file:

grep WIGHT nhsPracticesDec2011.CSV > wightPracDec2011.csv

The file wightPracDec2011.csv should now contain details of practices (one per row) based on the Isle of Wight. We can inspect the first few lines of the file using the head command, or use more to scroll through the data one page at a time (hit space bar to move on a page, ESCape to exit).

head wightPracDec2011.csv more wightPracDec2011.csv

Hmmm.. there’s a rogue practice in there from the Wirral – let’s refine the grep a little:

grep 'OF WIGHT' nhsPracticesDec2011.CSV > wightPracDec2011.csv more wightPracDec2011.csv

From looking at the data file itslef, along with the prescribing data release notes/glossary, we can see that each practice has a unique identifier. From previewing the head of the prescription data itself, as well as from the documentation, we know that the large prescription data file contains identifiers for each practice too. So based on the previous steps, can you figure out how to pull out the rows from the prescriptions file that relate to drugs issued by the Ventnor medical centre, which has code J84003? Like this, maybe?

grep J84003 nhsPrescribingDataDec2011.CSV > wightPrescDec2011_J84003.csv head wightPrescDec2011_J84003.csv

(It may take a minute or two, so be patient…)

We can check how many rows there actually are as follows:

wc -l wightPrescDec2011_J84003.csv

I was thinking it would be nice to be able to get prescription data from all the Isle of Wight practices, so how might we go about that. From reviewing my previous text mining posts, I noticed that I could pull out data from a file by column:

cut -f 2 -d ',' wightPracDec2011.csv

This lists column two of the file wightPracDec2011.csv where columns are comma delimited.

We can send this list of codes to the grep command to pull out records from the large prescriptions file for each of the codes we grabbed using the cut command (I asked on Twitter for how to do this, and got a reply back that seemed to do the trick pretty much by return of tweet from @smelendez):

cut -d ',' -f 2 wightPracDec2011.csv | grep nhsPrescribingDataDec2011.CSV -f - > iwPrescDec2011.csv more iwPrescDec2011.csv

We can sort the result by column – for example, in alphabetic order by column 5 (-k 5), the drugs column:

sort -t ',' -k 5 iwPrescDec2011.csv | head

Or we can sort by decreasing (-r) total ingredient cost:

sort -t ',' -k 7 -r iwPrescDec2011.csv | head

Or in decreasing order of the largest number of items:

sort -t ',' -k 6 -r iwPrescDec2011.csv | head

One problem with looking at those results is that we can’t obviously recognise the practice. (That might be a good thing, especially if we looked at item counts in increasing order… Whilst we don’t know how many patients were in receipt of one or more items of drug x if 500 or so items were prescribed in the reporting period across several practices, if there is only one item of a particular drug prescribed for one practice, then we’re down to one patient in receipt of that item across the island, which may be enough to identify them…) I leave it as an exercise for the reader to work out how you might reconcile the practice codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio another..?).

Using the iwPrescDec2011.csv file, we can now search to see how many items of a particular drug are prescribed across island practices using searches of the form:

grep Aspirin iwPrescDec2011.csv grep 'Peppermint Oil' iwPrescDec2011.csv

And this is where we now start to need taking a little care… Scanning through that data by eye, a bit of quick mental arithmetic (divide column 7 by column 6) suggests that the unit price for peppermint oil is different across practices. So is there a good reason for this? I would guess that the practices may well be describing different volumes of peppermint oil as single prescription items, which makes a quick item cost calculation largely meaningless? I guess we need to check the data glossary/documentation to confirm (or deny) this?

Okay – enough for now… maybe I’ll see how we can do a little more digging around this data in another post…

PS Just been doing a bit of doing around other GP practice level datasets – you can find a range of them on the NHS Indicator Portal. As well as administrative links up to PCT and Stategic Health Authority names, you can get data such as the size and demographic make up of each practice’s registration list, data relating to deprivation measures, models for incidence of various health conditions, practice address and phone number, the number of nursing home patients, the number of GPs per practice, the uptake of various IT initiatives(?!), patient experience data, impact on NHS services data… (Apparently a lot of this ata is available in a ‘user friendly’ format on NHS Choices website, ~~but I couldn’t find it offhand…~~ as part of the GP comparison service. Are there any third party sites around built on top of this data also?)

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

If you click the highlighted selection, you should see the data that will be used to create your project:

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv

writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL)

fn='my-fb-friends-likes.txt'

data = simplejson.load(open(fn,'r'))
id=0
for d in data['rows']:
	id=id+1
	#'interests' is the column name containing the Likes data
	interests=simplejson.loads(d['interests'])
	for i in interests['data']:
		print str(id),i['name'],i['category']
		writer.writerow([str(id),i['name'].encode('ascii','ignore')])

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

Mapping the New Year Honours List – Where Did the Honours Go?

When I get a chance, I’ll post a (not totally unsympathetic) response to Milo Yiannopoulos’post The pitiful cult of ‘data journalism’, but in the meantime, here’s a view over some data that was released a couple of days ago – a map of where the New Year Honours went [link]

[Hmm… so WordPress.com doesn’t seem to want to let me embed a Google Fusion Table map iframe, and Google Maps (which are embeddable) just shows an empty folder when I try to view the Fusion Table KML… (the Fusion Table export KML doesn’t seem to include lat/lng data either? Maybe I need to explore some hosting elsewhere this year…]

Note that I wouldn’t make the claim that this represents an example of data journalism. It’s a sketch map showing which parts of the country various recipients of honours this time round presumably live. Just by posting the map, I’m not reporting any particular story. Instead, I’m trying to find a way of looking at the day to see whether or not there may be any interesting stories that are suggested by viewing the data in this way.

There was a small element of work involved in generating the map view, though… Working backwards, when I used Google Fusion tables to geocode the locations of the honoured, some of the points were incorrectly located:

(It would be nice to be able to force a locale to the geocoder, maybe telling it to use maps.google.co.uk as the base, rather than (presumably) maps.google.com?)

The approach I took to tidying these was rather clunky, first going into the table view and filtering on the mispositioned locations:

Then correcting them:

What would be really handy would be if Google Fusion Tables let you see a tabular view of data within a particular map view – so for example, if I could zoom in to the US map and then get a tabular view of the records displayed on that particular local map view… (If it does already support this and I just missed it, please let me know via the comments..;-)

So how did I get the data into Google Fusion Tables? The original data was posted as a PDF on the DirectGov website (New Year Honours List 2012 – in detail)…:

…so I used Scraperwiki to preview and read through the PDF and extract the honours list data (my scraper is a little clunky and doesnlt pull out 100% of the data, missing the occasional name and contribution details when it’s split over several lines; but I think it does a reasonable enough job for now, particularly as I am currently more interested in focussing on the possible high level process for extracting and manipulating the data, rather than the correctness of it…!;-)

Here’s the scraper (feel free to improve upon it….:-): Scraperwiki: New Year Honours 2012

I then did a little bit of tweaking in Google Refine, normalising some of the facets and crudely attempting to separate out each person’s role and the contribution for which the award was made.

For example, in the case of Dr Glenis Carole Basiro DAVEY, given column data of the form “The Open University, Science Faculty and Health Education and Training Programme, Africa. For services to Higher and Health Education.“, we can use the following expressions to generate new sub-columns:

– value.match(/.*(For .*)/)[0] to pull out things like “For services to Higher and Health Education.”
– value.match(/(.*)For .*/)[0] to pull out things like “The Open University, Science Faculty and Health Education and Training Programme, Africa.”

I also ran each person’s record through Reuters Open Calais service using Google Refine’s ability to augment data with data from a URL (“Add column by fetching URLs”), pulling the data back as JSON. Here’s the URL format I used (polling once every 500ms in order to stay with the max. 4 calls per limit threshold mandated by the API.)

"http://api.opencalais.com/enlighten/rest/?licenseID=<strong>MY_LICENSE_KEY</strong>&content=" + escape(value,'url') + "&paramsXML=%3Cc%3Aparams%20xmlns%3Ac%3D%22http%3A%2F%2Fs.opencalais.com%2F1%2Fpred%2F%22%20xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22%3E%20%20%3Cc%3AprocessingDirectives%20c%3AcontentType%3D%22TEXT%2FRAW%22%20c%3AoutputFormat%3D%22Application%2FJSON%22%20%20%3E%20%20%3C%2Fc%3AprocessingDirectives%3E%20%20%3Cc%3AuserDirectives%3E%20%20%3C%2Fc%3AuserDirectives%3E%20%20%3Cc%3AexternalMetadata%3E%20%20%3C%2Fc%3AexternalMetadata%3E%20%20%3C%2Fc%3Aparams%3E"

Unpicking this a little:

– licenseID is set to my license key value
– content is the URL escaped version of the text I wanted to process (in this case, I created a new column from the name column that also pulled in data from a second column (the contribution column). The GREL formula I used to join the columns took the form: value+', '+cells["contribution"].value)
– paramsXML is the URL encoded version of the following parameters, which set the content encoding for the result to be JSON (the default is XML):

<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="Application/JSON"  >
</c:processingDirectives>
<c:userDirectives>
</c:userDirectives>
<c:externalMetadata>
</c:externalMetadata>
</c:params>

So much for process – now where are the stories? That’s left, for now, as an exercise for the reader. An obvious starting point is just to see who received honours in your locale. Remember, Google Fusion Tables lets you generate all sorts of filtered views, so it’s not too hard to map where the MBEs vs OBEs are based, for example, or have a stab at where awards relating to services to Higher Education went. Some awards also have a high correspondence with a particular location, as for example in the case of Enfield…

If you do generate any interesting views from the New Year Honours 2012 Fusion Table, please post a link in the comments. And if you find a problem with/fix for the data or the scraper, please post that info in a comment too:-)

More Dabblings With Local Sentencing Data

In Accessing and Visualising Sentencing Data for Local Courts I posted a couple of quick ways in to playing with Ministry of Justice sentencing data for the period July 2010-June 2011 at the local court level. At the end of the post, I wondered about how to wrangle the data in R so that I could look at percentage-wise comparisons between different factors (Age, gender) and offence type and mentioned that I’d posted a related question to to the Cross Validated/Stats Exchange site (Casting multidimensional data in R into a data frame).

Courtesy of Chase, I have an answer🙂 So let’s see how it plays out…

To start, let’s just load the Isle of Wight court sentencing data into RStudio:

require(ggplot2) require(reshape2) iw = read.csv("http://dl.dropbox.com/u/1156404/wightCrimRecords.csv")

Now we’re going to shape the data so that we can plot the percentage of each offence type by gender (limited to Male and Female options):

iw.m = melt(iw, id.vars = "sex", measure.vars = "Offence_type") iw.sex = ddply(iw.m, "sex", function(x) as.data.frame(prop.table(table(x$value)))) ggplot(subset(iw.sex,sex=='Female'|sex=='Male')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~sex)+ opts(axis.text.x=theme_text(angle=-90)) + xlab('Offence Type')

Here’s the result:

We can also process the data over a couple of variables. So for example, we can look to see how female recorded sentences break down by offence type and age range, displaying the results as a percentage of how often each offence type on its own was recorded by age:

iw.m2 = melt(iw, id.vars = c("sex","Offence_type" ), measure.vars = "AGE") iw.off=ddply(iw.m2, c("sex","Offence_type"), function(x) as.data.frame(prop.table(table(x$value))))

ggplot(subset(iw.off,sex=='Female')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~Offence_type) + opts(axis.text.x=theme_text(angle=-90)) + xlab('Age Range (Female)')

Note that this graphic may actually be a little misleading because percentage based reports donlt play well with small numbers…: whilst there are multiple Driving Offences recorded, there are only two Burglaries, so the statistical distribution of convicted female burglars is based over a population of size two… A count would be a better way of showing this

PS I was hoping to be able to just transmute the variables and generate a raft of other charts, but I seem to be getting an error, maybe because some rows are missing? So: anyone know where I’m supposed to post R library bug reports?

Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

– the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
– the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

head recordlevel.csv

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

recordlevel <- read.csv("~/data/recordlevel.csv") iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw$AGE) barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw$AGE, iw$Offence_type) barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2) #You may need to install ggplot2 as a library if it isn't already installed ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below 😉

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male']) iwpFemale=data.frame(iwp['Female'])
We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)