Tag Archives: onlinejournalismblog

Local Council Announcements via Newspapers, and Maybe Hyperlocal Blogs, Too…?

In a post on local council declarations of designated public place, I remarked on the following clause in The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007:

“5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“

and idly wondered: how is a “newspaper circulating in its area” defined?

In Statutory Instrument 2012 No. 2089, The Local Authorities (Executive Arrangements) (Meetings and Access to Information) (England) Regulations 2012, which comes into force on September 10th, 2012, there is the following note on interpretations to be used within the regulations:

“newspaper” includes—
(a) a news agency which systematically carries on the business of selling and supplying reports or information to the newspapers; and
(b) any organisation which is systematically engaged in collecting news—
(i) for sound or television broadcasts;
(ii) for inclusion in programmes to be included in any programme service within the meaning of the Broadcasting Act 1990(6) other than a sound or television broadcasting service within the meaning of Part 3 or Part 1 of that Act respectively; or
(iii) for use in electronic or any other format to provide news to the public by means of the internet; [my emphasis]

So, for the purposes of those regulations, a newspaper includes any organisation which is systematically engaged in collecting news for use in electronic or any other format to provide news to the public by means of the internet. (The 2007 regs interpretation section don’t clarify the meaning of “newspapers”.)

This presumably means, for example, that under regulation 14(2), my local hyperlocal blog, Ventnorblog, could request as a newspaper a copy of any of the documents available for public inspection following payment to the Isle of Wight Council of “postage, copying or other necessary charge for transmission”. Assuming Ventnorblog passes the test of (a) being an organisation, that (b) is systematically engaged in collecting news, and for backwards compatibility with other regulations can be show to be (c) circulating in the local council area.

If this interpretation of newspapers applies more widely, (for example, if this interpretation is now applied across outstanding Local Government regulations), it also suggests that whereas councils would traditionally have had to place an advert in their local (print) newspaper, now they can do it via a hyperlocal blog?

So this might also include announcements about Licensing of public entertainments [1(4)], or adult shops [2(2)], tattoo shops [13(6)], temporary markets [37(1)], traffic orders [17(2a)], etc etc

PS Hmm, I wonder, is there a single list somewhere detailing all the legislation that requires local councils to “publish in a newspaper circulating in the area”…?

PPS via @robandale, this Local Government Information Unit initiative on Reforming Statutory Notices. It links to a survey for local councils relating to “how many notices are produced, how much they cost, how effective you believe they are etc.” to try to get a baseline on current practice.

A quick peek at the Isle of Wight Council Armchair Auditor (as run by Ventnorblog and updated from Adrian Short’s original version) gives us an idea of how much the Isle of Wight Council spends on “Advertising and Publicity” in areas such as “Traffic Management” (so presumably statutory notices relating to roadworks, road closures etc?) with the local rag, the Isle of Wight County Press.

Presumably a search on OpenlyLocal’s council spending dashboard would turn up similar spending categories for other councils? (It could be quite interesting to try exploring that… I’m not sure if data on the OpenlyLocal spending dashboard has been updated lately, though? I think OpenSpending take their data from OpenlyLocal, so that isn’t much additional help. And the DCLG opendatacommunities.org site only has budget related finance data at the moment?)

Hmm..thinks.. you could get an idea of how much external spending burden different bits of legislation impose on councils from their spending data, couldn’t you?

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

[Here's an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they'd done similar views in previous years using a Flash component, but the current iteration uses d3.js]

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)

PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data
).

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship

Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

However, suppose that you need to get the data into a different shape – maybe one line per country with an additional column specifying the medal type. Something like this, for example:

How can we generate that sort of view from the original data set? Here’s one way, that works when the columns you want to split into row values are contiguous (that is, next to each other). From the first column in the set of columns you want to be transformed, select Transpose > Transpose cells across columns into rows:

We now set the original selected column headers to be the cell value within a new column – MedalType – and the original cell values the value within a Country column:

(Note that we could also just transform the data into a single column. For example, suppose we had columns relating to courses currently taken by a particular student (Course 1, Course 2, Course 3), with a course code as cell value and one, two or three columns populated per student. If we wanted one row per student per course, we could just map the three columns onto a single column – CourseCode – and assign multiple rows to each student, then filtering out rows with a blank value in the CourseCOde column as required.)

Ticking the Fill down in other columns checkbox ensures that the appropriate Sport and Event values are copied in to the newly created rows:

Having worked out how to do that oft-required bit of data reshaping, I thought I could probably have another go at something that has been troubling me for ages – how to generate multiple rows from a single row where one of the columns contains JSON data (maybe pulled from a web service/API) that contains multiple items. This is a “mate in three” sort of problem, so here’s how I started to try to work it back. Given that I now know how to map columns onto rows, can I work out how to map different results in a JSON response onto different columns?

For example, here’s a result from the Facebook API for a search on a particular OU course code and the word open in a Facebook group name:

{“data”:[{"version":1,"name":"U101 (Open University) start date February 2012","id":"325165900838311"},{"version":1,"name":"Open university, u101- design thinking, October 2011","id":"250227311674865"},{"version":1,"name":"Feb 2011 Starters U101 Design Thinking - Open University","id":"121552081246861"},{"version":1,"name":"Open University - U101 Design Thinking, Feburary 2011","id":"167769429928476"}],”paging”:{“next”:…etc…}}

It returns a couple of results in the data element, in particular group name and group ID. Here’s one way I found of creating one row per group… Start off by creating a new column based on the JSON data column that parses the results in the data element into a list:

We can then iterate over the list items in this new column using the forEach grel command. The join command then joins the elements within each list item, specifically the group ID and name values in each result:

forEach(value.parseJson(),v,[v.id,v.name].join('||'))

You’ll notice that for multiple results, this produces a list of joined items, which we can also join together by extending the GREL expression:

forEach(value.parseJson(),v,[v.id,v.name].join('||')).join('::')

We now have a column that contains ‘||’ and ‘::’ separated items – :: separates individual group results from each other, || separates the id and name for each particular group.

Given we know how to create rows from multiple columns, we could try to split this column into separate columns using Edit column > Split into separate columns. This would create one column per result, which we could then transform into rows, as we did above. Whilst I don’t recommend this route in this particular case, here’s how we could go about doing it…

A far better approach is to use the Edit cells > split multi-valued cells option to automatically create new rows based on splitting the elements in a single column:

Note, however that this creates blanks in the other columns, so we need to Edit cells > Fill down to fill in the blanks in any other columns we want to refer to. After doing that, we end up with something like this:

We could now split the groupPairs column using the || separator to create two columns – Group ID and group name – giving us one row per group, and separate columns identifying the course, group name and group ID.

If the above route seems a little complicated, fear not…Once you apply it, it starts to make sense!

Searching for a Map of Designated Public Places…

A discussion, earlier, about whether it was now illegal to drink in public…

…I thought not, think not, at least, not generally… My understanding was, that local authorities can set up controlled, alcohol free zones and create some sort of civil offence for being caught drinking alcohol there. (As it is, councils can set up regions where public consumption of alcohol may be prohibited and this prohibition may be enforced by the police.) So surely there must be an #opendata powered ‘no drinking here’ map around somewhere? The sort of thing that might result from a newspaper hack day, something that could provide a handy layer on a pub map? I couldn’t find one, though…

I did a websearch, turned up The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007, which does indeed appear to be the bit of legislation that regulates drinking alcohol in public, along with a link to a corresponding guidance note: Home Office circular 013 / 2007:

16. The provisions of the CJPA [Criminal Justice and Police Act 2001, Chapter 2 Provisions for combatting alcohol-related disorder] should not lead to a comprehensive ban on drinking in the open air.

17. It is the case that where there have been no problems of nuisance or annoyance to the public or disorder having been associated with drinking in that place, then a designation order … would not be appropriate. However, experience to date on introducing DPPOs has found that introducing an Order can lead to nuisance or annoyance to the public or disorder associated with public drinking being displaced into immediately adjacent areas that have not been designated for this purpose. … It might therefore be appropriate for a local authority to designate a public area beyond that which is experiencing the immediate problems caused by anti-social drinking if police evidence suggests that the existing problem is likely to be displaced once the DPPO was in place. In which case the designated area could include the area to which the existing problems might be displaced.

Creepy, creep, creep…

This, I thought, was interesting too, in the guidance note:

37. To ensure that the public have full access to information about designation orders made under section 13 of the Act and for monitoring arrangements, Regulation 9 requires all local authorities to send a copy of any designation order to the Secretary of State as soon as reasonably practicable after it has been made.

38. The Home Office will continue to maintain a list of all areas designated under the 2001 Act on the Home Office website: www.crimereduction.gov.uk/alcoholorders01.htm [I'm not convinced that URL works any more...?]

39. In addition, local authorities may wish to consider publicising designation orders made on their own websites, in addition to the publicity requirements of the accompanying Regulations, to help to ensure full public accessibility to this information.

So I’m thinking: this sort of thing could be a great candidate for a guidance note from the Home Office to local councils recommending ways of releasing information about the extent of designation orders as open geodata. (Related? Update from ONS on data interoperability (“Overcoming the incompatibility of statistical and geographic information systems”).)

I couldn’t immediately find a search on data.gov.uk that would turn up related datasets (though presumably the Home Office is aggregating this data, even if it’s just in a filing cabinet or mail folder somewhere*), but a quick websearch for Designated Public Places site:gov.uk intitle:council turned up a wide selection of local council websites along with their myriad ways of interpreting how to release the data. I’m not sure if any of them release the data as geodata, though? Maybe this would be an appropriate test of the scope of the Protection of Freedoms Act Part 6 regulations on the right to request data as data (I need to check them again…)?

* The Home Office did release a table of designated public places in response to an FOI request about designated public place orders, although not as data… But it got me wondering: if I scheduled a monthly FOI request to the Home Office requesting the data on a monthly basis, would they soon stop fulfilling the requests as timewasting? How about if we got a rota going?! Is there any notion of a longitudinal/persistent FOI request, that just keeps on giving (could I request the list of designated public places the Home Office has been informed about over the last year, along with a monthly update of requests in the previous month (or previous month but one, or whatever is reasonable…) over the next 18 months, or two years, or for the life of the regulation, or until such a time as the data is published as open data on a regular basis?

As for the report to government that a local authority must make on passing a designation order – 9. A copy of any order shall be sent to the Secretary of State as soon as reasonably practicable after it has been made. – it seems that how the area denoted as a public space is described is moot: “5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“. Hmmm, two things jump out there…

Firstly, a local authority shall cause to be published in a newspaper circulating in its area [my emphasis; how is a newspaper circulating in its area defined? Do all areas of England have a non-national newspaper circulating in that area? Does this implicitly denote some "official channel" responsibility on local newspapers for the communication of local government notices?]. Hmmm…..

Secondly, the area identified specifically or by description. On commencement, the order must also be made public by “identifying the place which has been identified in the order”, again “in a newspaper circulating in its area”. But I wonder – is there an opportunity there to require something along the lines of and published using an appropriate open data standard in a open public data repository, and maybe further require that this open public data copy is the one that is used as part of the submission informing the Home Office about the regulation? And if we go overboard, how about we further require that each enacted and proposed order is published as such along with a machine readable geodata description and that a single aggregate files containing all that Local Authority’s currently and planned Designated Public Spaces are also published (so one URL for all current spaces, one for all planned ones). Just by the by, does anyone know of any local councils publishing boundary date/shapefiles that mark out their Designated Public Spaces? Please let me know via the comments, if so…

A couple of other, very loosely (alcohol) related, things I found along the way:

  • Local Alcohol Profiles for England: the aim appears to have been the collation of, and a way of exploring, a “national alcohol dataset”, that maps alcohol related health indicators on a PCT (Primary Care Trust) and LA (local authority) basis. What this immediately got me wondering was: did they produce any tooling, recipes or infrastructure that would it make a few clicks easy to pull together a national tobacco dataset and associated website, for example? And then I found the Local Tobacco Control Profiles for England toolkit on the London Health Observatory website, along with a load of other public health observatories and it made me remember – again – just how many data sensemaking websites there already are out there…
  • UK Alcohol Strategy – maybe some leads into other datasets/data stories?

PS I wonder if any of the London Boroughs or councils hosting regional events have recently declared any new Designated Public Spaces #becauseOfTheOlympics.

Quick and Dirty Recipe: Merging (Concatenating) Multiple CSV files (ODA Spending)

There’s been a flurry of tweets over the last few days about LOCOG’s exemption from FOI (example LOCOG response to an FOI request), but the Olympic Delivery Authority (ODA, one of the owner stakeholders) is rather more open, and publishes its spends over £25k: ODA Finance: Transparency Reports.

CSV files containing spend on a monthly basis are available from the site, using a consistent CSV file format each time (I think…). For what it’s worth, I thought it might be worth sharing a pragmatic, though not ideal, Mac/Linux/unixtools commandline recipe for generating a single file containing all this data.

  1. Right-click and download each of the CSV files on the page to the same directory (eg odaSpending) on your local machine. (There are easier ways of doing this – I tried wget on the command line, but got an Access Denied response (workaround anyone?); there are probably more than a few browser extensions/plugins that will also download all the files linked to from a page. If so, you just want to grab the csv files; if you get them all, from the command line, just copy the csv files to a new directory: eg mkdir csvfiles;cp *.csv csvfiles)
  2. On the commandline, change directory to the files directory – eg cd odaSpending/csvfiles; then join all the files together: files=*; cat $files > odaspending.csv
  3. You should now have a big file, odaspending.csv, containing all the data, although it will also contain multiple header rows (each csv file had its own header row). Open the file in a text editor (I use TextWrangler), copy from the start of the first line to the start of the second (ie copy the header row, including the end of line/carriage return), then do a Find on the header and global Replace with nothing replacing the search string. Then, depending where you started the replace, maybe paste the header (if required) back into the first row

To turn the data file into something you can explore more interactively, upload it to something like Google Fusion Tables, as I did here (data to May 2012): ODA Spending in Google Fusion Tables

Note that this recipe is a pragmatic one. Unix gurus would surely be able to work out far more efficient scripts that concatenate the files after stripping out the header in all but the first file, for example, or that maybe even check the columns are the same etc etc. But if you want something quick and dirty, this is one way of doing it… (Please feel free to add alternative recipes for achieving the same thing in the comments…)

PS here’s an example of one sort of report you can then create in Fusion Tables – ODA spend with G4S; here’s another: Seconded staff

Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).

That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…

F1 Championship Points as a d3.js Powered Sankey Diagram

d3.js crossed my path a couple of times yesterday: firstly, in the form of an enquiry about whether I’d be interested in writing a book on d3.js (I’m not sure I’m qualified: as I responded, I’m more of a script kiddie who sees things I can reuse, rather than have any understanding at all about how d3.js does what it does…); secondly, via a link to d3.js creator Mike Bostock’s new demo of Sankey diagrams built using d3.js:

Hmm… Sankey diagrams are good for visualising flow, so to get to grips myself with seeing if I could plug-and-play with the component, I needed an appropriate data set. F1 related data is usually my first thought as far as testbed data goes (no confidences to break, the STEM/innovation outreach/tech transfer context, etc etc) so what things flow in F1? What quantities are conserved whilst being passed between different classes of entity? How about points… points are awarded on a per race basis to drivers who are members of teams. It’s also a championship sport, run over several races. The individual Driver Championship is a competition between drivers to accumulate the most points over the course of the season, and the Constructor Chanmpionship is a battle between teams. Which suggests to me that a Sankey plot of points from races to drivers and then constructors might work?

So what do we need to do? First up, look at the source code for the demo using View Source. Here’s the relevant bit:

Data is being pulled in from a relatively addressed file, energy.json. Let’s see what it looks like:

Okay – a node list and an edge list. From previous experience, I know that there is a d3.js JSON exporter built into the Python networkx library, so maybe we can generate the data file from a network representation of the data in networkx?

Here we are: node_link_data(G) “[r]eturn data in node-link format that is suitable for JSON serialization and use in Javascript documents.”

Next step – getting the data. I’ve already done a demo of visualising F1 championship points sourced from the Ergast motor racing API as a treemap (but not blogged it? Hmmm…. must fix that) that draws on a JSON data feed constructed from data extracted from the Ergast API so I can clone that code and use it as the basis for constructing a directed graph that represents points allocations: race nodes are linked to driver nodes with edges weighted by points scored in that race, and driver nodes are connected to teams by edges weighted according to the total number of points the driver has earned so far. (Hmm, that gives me an idea for a better way of coding the weight for that edge…)

I don’t have time to blog the how to of the code right now – train and boat to catch – but will do so later. If you want to look at the code, it’s here: Ergast Championship nodelist. And here’s the result – F1 Chanpionship 2012 Points as a Sankey Diagram:

See what I mean about being a cut and paste script kiddie?!;-)

Inter-Council Payments and the Google Fusion Tables Network Graph

One of the great things about aggregating local spending data from different councils in the same place – such as on OpenlyLocal – is that you can start to explore structural relations in the way different public bodies of a similar type spend money with each other.

On the local spend with corporates scraper on Scraperwiki, which I set up to scrape how different councils spent money with particular suppliers, I realised I could also use the scraper to search for how councils spent money with other councils, by searching for suppliers containing phrases such as “district council” or “town council”. (We could also generate views to to see how councils wre spending money with different police authorities, for example.)

(The OpenlyLocal API doesn’t seem to work with the search, so I scraped the search results HTML pages instead. Results are paged, with 30 results per page, and what seems like a maximum of 1500 (50 pages) of results possible.)

The publicmesh table on the scraper captures spend going to a range of councils (not parish councils) from other councils. I also uploaded the data to Google Fusion tables (public mesh spending data), and then started to explore it using the new network graph view (via the Experiment menu). So for example, we can get a quick view over how the various county councils make payments to each other:

Hovering over a node highlights the other nodes its connected to (though it would be good if the text labels from the connected nodes were highlighted and labels for unconnected nodes were greyed out?)

(I think a Graphviz visualisation would actually be better, eg using Canviz, because it can clearly show edges from A to B as well as B to A…)

As with many exploratory visualisations, this view helps us identify some more specific questions we might want to ask of the data, rather than presenting a “finished product”.

As well as the experimental network graph view, I also noticed there’s a new Experimental View for Google Fusion Tables. As well as the normal tabular view, we also get a record view, and (where geo data is identified?) a map view:

What I’d quite like to see is a merging of map and network graph views…

One thing I noticed whilst playing with Google Fusion Tables is that getting different aggregate views is rather clunky and relies on column order in the table. So for example, here’s an aggregated view of how different county councils supply other councils:

In order to aggregate by supplied council, we need to reorder the columns (the aggregate view aggregates columns as thet appear from left to right in the table view). From the Edit column, Modify Table:

(In my browser, I then had to reload the page for the updated schema to be reflected in the view). Then we can get the count aggregation:

It would be so much easier if the aggregation view allowed you to order the columns there…

PS no time to blog this properly right now, but there are a couple of new javascript libraries that are worth mentioning in the datawrangling context.

In part coming out of the Guardian stable, Misoproject is “an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content”. The initial dataset library provides a set of routines for: loading data into the browser from a variety of sources (CSV, Google spreadsheets, JSON), including regular polling; creating and managing data tables and views of those tables within the browser, including column operations such as grouping, statistical operations (min, max, mean, moving average etc); playing nicely with a variety of client side graphics libraries (eg d3.js, Highcharts, Rickshaw and other JQuery graphics plugins).

Recline.js is a library from Max Ogden and the Open Knowledge Foundation that if its name is anything to go by is positioning itself as an alternative (or complement?) to Google Refine. To my mind though, it’s more akin to a Google Fusion Tables style user interface (“classic” version) wherever you need it, via a Javascript library. The data explorer allows you to import and preview CSV, Excel, Google Spreadsheet and ElasticSearch data from a URL, as well as via file upload (so for example, you can try it with the public spend mesh data CSV from Scraperwiki). Data can be sorted, filtered and viewed by facet, and there’s a set of integrated graphical tools for previewing and displaying data too. Refine.js views can also be shared and embedded, which makes this an ideal tool for data publishers to embed in their sites as a way of facilitating engagement with data on-site, as I expect we’ll see on the Data Hub before too long.

More reviews of these two libraries later…

PPS These are also worth a look in respect of generating visualisations based on data stored in Google spreadsheets: DataWrapper and Freedive (like my old Guardian Datastore explorer, but done properly… Wizard led UI that helps you create your own searchable and embeddable database view direct from a Google Spreadsheet).

Working With Excel Spreadsheet Files Without Using Excel…

One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV ;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?

There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…

(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)

Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):

Once loaded, you get a preview view..

You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.

If I then create a project around this dataset, I can trivially export it again using a format of my own preference:

So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.

The second approach I want to mention is to use a really handy Python software library (xlrd – Excel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.

import scraperwiki
import xlrd

#cribbing https://scraperwiki.com/docs/python/python_excel_guide/
def cellval(cell):
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    return cell.value

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

def reGrabber():
    #dropper('GPpracticeLookup')
    url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls'
    xlbin = scraperwiki.scrape(url)
    book = xlrd.open_workbook(file_contents=xlbin)

    sheet = book.sheet_by_index(0)        

    keys = sheet.row_values(8)           
    keys[1] = keys[1].replace('.', '')
    print keys

    for rownumber in range(9, sheet.nrows):           
        # create dictionary of the row values
        values = [ cellval(c) for c in sheet.row(rownumber) ]
        data = dict(zip(keys, values))
        #print data
        scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data)

#Uncomment the next line if you want to regrab the data from the original spreadsheet
reGrabber()

You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.

(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)

So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)

PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?

PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…