Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.
I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:
Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.
Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…
So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…
Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…
Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).
Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)
Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)
lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)
I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)
Prior programming experience suggests this should also work…
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))
Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):
boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)
Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
Hmm… can we set the filter value dynamically I wonder?
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
Seems like it…?:-) We can also combine interactive controls:
manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))
Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…
The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.
[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]