Following my post on data journalism teaching fellow trainer Peter Verweij got in touch to share a post which first appeared on his blog earlier this month. I’m reproducing it here with permission. A Dutch version is also available here. Continue reading
In a guest post for OJB, Natalia Karbasova explains how, with no coding experience, she used German carpool data for the basis of a data visualisation project.
Some time ago I was working on a new blog on the sharing economy, lets-share.de. It was high time to add some data-driven stories visualising important issues of the sharing economy, which change our lives.
Mitfahrgelegenheit.de is the popular German version of Carpooling.com. I decided to create a visualization which would show carpooling patterns between cities in Germany and, possibly, reveal hidden connections. Continue reading
Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:
Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:
Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.
I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:
One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.
A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):
(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)
All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.
A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:
Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.
[Here’s an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they’d done similar views in previous years using a Flash component, but the current iteration uses d3.js]
The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.
Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:
(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)
Of course, without the data, it can be difficult trying to generate these charts…
…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)
PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.
Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.
Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data).
Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.
Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.
The data I have scraped is represented using rows of the form:
Country, Event, Gold, Silver, Bronze
where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)
I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:
- MedalType, Country, Event
- MedalType, Event, Country
- Event, MedalType, Country
- Event, Country, MedalType
- Country, MedalType, Event
- Country, Event, MedalType
Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.
First up, ordered by Medal, Country, Event:
This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:
The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.
The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:
and the Country Medal Event view allows us to then tunnel in on the gold winning events:
I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.
It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)
Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.
PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances
PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship
Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.
Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:
There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).
That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)
PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…
One of the easiest ways to get someone started on data visualisation is to introduce them to word clouds (it also demonstrates neatly how not all data is numerical).
Using tools like Wordle and Tagxedo, you can paste in a major speech and see it visualised within a minute or so.
But is a word cloud the best way of visualising speeches? The New York Times appear to think otherwise. Their visualisation (above) comparing President Obama’s State of the Union address and speeches by Republican presidential candidates chooses to use something far less fashionable: the bar chart.
Why did they choose a bar chart? The key is the purpose of the chart: comparison. If your objective is to capture the spirit of a speech, or its key themes, then a word cloud can still work well, if you clean the data (see this interactive example that appeared on the New York Times in 2009).
But if you want to compare it to speeches of others – and particularly if you want to compare on specific issues such as employment or tax – then bar charts are a better choice. Compare, for example, ReadWriteWeb’s comparison of inaugural speeches, and how effective that is compared to the bar charts.
In short, don’t always reach for the obvious chart type – and be clear what you’re trying to communicate.
In a move that seemed to upset collectors of UK ministerial meeting data, @whoslobbying, on grounds of wasted effort, the Guardian datastore published a spreadsheet last night containing data relating to ministerial meetings between May 2010 and March 2011.
(The first release of the spreadsheet actually omitted the column containing who the meeting was with, but that seems to be fixed now… There are, however, still plenty of character encoding issues (apostrophes, accented characters, some sort of em-dash, etc) that might cripple some plug and play tools.)
Looking over the data, we can use it as the basis for a network diagram with actors (Ministers and lobbiests) with edges representing meetings between Minsiters and lobbiests. There is one slight complication in that where there is a meeting between a Minister and several lobbiests, we ideally need to separate out the separate lobbiests into their own nodes.
This probably provides an ideal opportunity to have a play with the Stanford Data Wrangler and try forcing these separate lobbiests onto separate rows, but I didn’t allow myself much time for the tinkering (and the requisite learning!), so I resorted to Python script to read in the data file and split out the different lobbiests. (I also did an iterative step, cleaning the downloaded CSV file in a text editor by replacing nasty characters that caused the script to choke.) You can find the script here (note that it makes use of the networkx network analysis library, which you’ll need to install if you want to run the script.)
The script generates a directed graph with links from Ministers to lobbiests and dumps it to a GraphML file (available here) that can be loaded directly into Gephi. Here’s a view – using Gephi – of the hearth of the network. If we filter the graph to show nodes that met with at least five different Ministers…
we can get a view into the heart of the UK lobbying netwrok:
I sized the lobbiest nodes according to eigenvector centrality, which gives an indication of well connected they are in the network.
One of the nice things about Gephi is that it allows for interactive exploration of a graph, For example, I can hover over a lobbiest node – Barclays in this case – to see which Ministers were met:
Alternatively, we can see who of the well connected met with the Minister for Welfare Reform:
Looking over the data, we also see how some Ministers are inconsistently referenced within the original dataset:
Note that the layout algorithm is such that the different representations of the same name are likely to meet similar lobbiests, which will end up placing the node in a similar location under the force directed layout I used. Which is to say – we may be able to use visual tools to help us identify fractured representations of the same individual. (Note that multiple meetings between the same parties can be visualised using the thickness of the edges, which are weighted according to the number of times the edge is described in the GraphML file…)
Unifying the different representations of the same indivudal is something that Google Refine could help us tidy up with its various clustering tools, although it would be nice if the Datastore folk addressed this at source (or at least, as part of an ongoing data quality enhancement process…;-)
I guess we could also trying reconciling company names against universal company identifiers, for example by using Google Refine’s reconciliation service and the Open Corporates database? Hmmm, which makes me wonder: do MySociety, or Public Whip, offer an MP or Ministerial position reconciliation service that works with Google Refine?
A couple of things I haven’t done: represented the department (which could be done via a node attribute, maybe, at least for the Ministers); represented actual meetings, and what I guess we might term co-lobbying behaviour, where several organisations are in the same meeting.