Tag Archives: statistics

Panini sticker albums – a great way to learn programming and statistics

1970 sticker album - image by John Cooper

1970 sticker album – image by John Cooper

When should you stop buying football stickers? I don’t mean how old should you be – but rather, at what point does the law of diminishing returns mean that it no longer makes sense to buy yet another packet of five stickers?

This was the question that struck me after seeing James Offer‘s ‘How much could it cost to fill a World Cup Sticker Album?Continue reading

“I don’t do maths”: how j-schools teach statistics to journalists

stats Image by Simon Cunningham

Image by Simon Cunningham

Teresa Jolley reports from a conference for teaching statistics to journalism students

I am not a great ‘numbers’ person, but even I was surprised by the attitudes that journalism lecturers at the Statistics in Journalism conference reported in their students.

‘I don’t do numbers’ and ‘I hate maths’ were depressingly common expressions, perhaps unsurprisingly. People wanting to study journalism enjoy the use of language and rarely expect that numbers will be vital to the stories they are telling.

So those responsible for journalism education have a tricky task. A bit like providing a sweet covering to a nasty-tasting tablet, it was said that lecturers need to be adept at finding ingenious ways to teach a practical and relevant use of numbers without ever mentioning the M (maths) or S (statistics) words. Continue reading

The US election was a wake up call for data illiterate journalists

So Nate Silver won in 50 states; big data was the winner; and Nate Silver and data won the election. And somewhere along the lines some guy called Obama won something, too.

Elections set the pace for much of journalism’s development: predictable enough to allow for advance planning; big enough to justify the budgets to match, they are the stage on which news organisations do their growing up in public.

For most of the past decade, those elections have been about social media: the YouTube election; the Facebook election; the Twitter election. This time, it wasn’t about the campaigning (yet) so much as it was about the reporting. And how stupid some reporters ended up looking. Continue reading

The £10,000 question: who benefits most from a tax threshold change?

UPDATE [Feb 14 2012]: Full Fact picked up the challenge and dug into the data:

“The crucial difference is in methodology – while the TPA used individuals as its basis, the IFS used households as provided by the Government data.

“This led to substantially different conclusions. The IFS note that using household income as a measure demonstrates increased gains for households with two or more earners. As they state:

“”families with two taxpayers would gain more than families with one taxpayer, who tend to be worse off. Thus, overall, better-off families (although not the very richest) would tend to gain most in cash terms from this reform…””

Here’s a great test for eagle-eyed journalists, tweeted by Guardian’s James Ball. It’s a tale of two charts that claim to show the impact of a change in the income tax threshold to £10,000. Here’s the first:

Change in post-tax income as a percentage of gross income

And here’s the second:

Net impact of income tax threshold change on incomes - IFS

So: same change, very different stories. In one story (Institute for Fiscal Studies) it is the the wealthiest that appear to benefit the most; but in the other (Taxpayers’ Alliance via Guido Fawkes) it’s the poorest who are benefiting.

Did you spot the difference? The different y axis is a slight clue – the first chart covers a wider range of change – but it’s the legend that gives the biggest hint: one is measuring change as a percentage of gross income (before, well, taxes); the other as a change in net income (after tax).

James’s colleague Mary Hamilton put it like this: “4.5% of very little is of course much less than 1% of loads.” Or, more specifically: 4.6% of £10,853 (the second decile mentioned in Fawkes’ post) is £499.24; 1.1% of £47,000 (the 9th decile according to the same ONS figures) is £517. (Without raw data, it’s hard to judge what figures are being used – if you include earnings over that £47k marker then it changes things, for example, and there’s no link to the net earnings).

In a nutshell, like James, I’m not entirely sure why they differ so strikingly. So, further statistical analysis welcome.

UPDATE: Seems a bit of a Twitter fight erupted between Guido Fawkes and James Ball over the source of the IFS data. James links to this pre-election document containing the chart and this one on ‘Budget 2011′. Guido says the chart’s “projections were based on policy forecasts that didn’t pan out”. I’ve not had the chance to properly scrutinise the claims of either James or Guido. I’ve also yet to see a direct link to the Taxpayers’ Alliance data, so that is equally in need of unpicking.

In this post, however, my point isn’t to do with the specific issue (or who is ‘right’) but rather how it can be presented in different ways, and the importance of having access to the raw data to ‘unspin’ it.

A quick exercise for aspiring data journalists

A funnel plot of bowel cancer mortality rates in different areas of the UK

The latest Ben Goldacre Bad Science column provides a particularly useful exercise for anyone interested in avoiding an easy mistake in data journalism: mistaking random variation for a story (in this case about some health services being worse than others for treating a particular condition):

“The Public Health Observatories provide several neat tools for analysing data, and one will draw a funnel plot for you, from exactly this kind of mortality data. The bowel cancer numbers are in the table below. You can paste them into the Observatories’ tool, click “calculate”, and experience the thrill of touching real data.

“In fact, if you’re a journalist, and you find yourself wanting to claim one region is worse than another, for any similar set of death rate figures, then do feel free to use this tool on those figures yourself. It might take five minutes.”

By the way, if you want an easy way to get that data into a spreadsheet (or any other table on a webpage), try out the =importHTML formula, as explained on my spreadsheet blog (and there’s an example for this data here).

Statistics as journalism redux: Benford’s Law used to question company accounts

A year and a day ago (which is slightly eerie) I wrote about how one Mexican blogger had used Benford’s Law to spot some unreliable data on drug-related murders being used by the UN and Mexican police.

On Sunday Jialan Wang used the same technique to look at US accounting data on over 20,000 firms – and found that over the last few decades the data has become increasingly unreliable.

Deviation from Benford's Law over time

“According to Benford’s law,” she wrote, “accounting statements are getting less and less representative of what’s really going on inside of companies. The major reform that was passed after Enron and other major accounting standards barely made a dent.”

She then drilled down into three industries: finance, information technology, and manufacturing, and here’s where it gets even more interesting.

“The finance industry showed a huge surge in the deviation from Benford’s from 1981-82, coincident with two major deregulatory acts that sparked the beginnings of that other big mortgage debacle, the Savings and Loan Crisis.  The deviation from Benford’s in the finance industry reached a peak in 1988 and then decreased starting in 1993 at the tail end of the S&L fraud wave, not matching its 1988 level until … 2008.”

Benford's law, by industry

She continues:

“The time series for information technology is similarly tied to that industry’s big debacle, the dotcom bubble.  Neither manufacturing nor IT showed the huge increase and decline of the deviation from Benford’s that finance experienced in the 1980s and early 1990s, further validating the measure since neither industry experienced major fraud scandals during that period.  The deviation for IT streaked up between 1998-2002 exactly during the dotcom bubble, and manufacturing experienced a more muted increase during the same period.”
The correlation and comparison adds a compelling level to the work, as Benford’s Law is a method of detecting fraud rather than proving it. As Wang writes herself:
“Deviations from Benford’s law are [here] compellingly correlated with known financial crises, bubbles, and fraud waves.  And overall, the picture looks grim.  Accounting data seem to be less and less related to the natural data-generating process that governs everything from rivers to molecules to cities.  Since these data form the basis of most of our research in finance, Benford’s law casts serious doubt on the reliability of our results.  And it’s just one more reason for investors to beware.”

I love this sort of stuff, because it highlights how important it is for us to question data just as much as we question any other source, while showing just how that can be done.

It also highlights just how central that data often is to key decisions that we and our governments make. Indeed, you might suggest that financial journalists should be doing this sort of stuff routinely if they want to avoid being caught out by the next financial crisis. Oh, as well as environment reporters and crime correspondents.

Why the “Cost to the economy” of strike action could be misleading

It’s become a modern catchphrase. When planes are grounded, when cars crash, when computers are hacked, and when the earth shakes. There is, it seems, always a “cost to the economy”.

Today, with a mass strike over pensions in the UK, the cliche is brought forth again:

“The Treasury could save £30m from the pay forfeited by the striking teachers today but business leaders warned that this was hugely outbalanced by the wider cost to the economy of hundreds of thousands of parents having to take the day off.

“The British Chambers of Commerce said disruption will lead to many parents having to take the day off work to look after their children, losing them pay and hitting productivity.”

Statements like these (by David Frost, the director general, it turns out) pass unquestioned (also here, here and elsewhere), but in this case (and I wonder how many others), I think a little statistical literacy is needed.

Beyond the churnalism of ‘he said-she said’ reporting, when costs and figures are mentioned journalists should be asking to see the evidence.

Here’s the thing. In reality, most parents will have taken annual leave today to look after their children. That’s annual leave that they would have taken anyway, so is it really costing the economy any more to take that leave on this day in particular? And specifically, enough to “hugely outbalance” £30m?

Stretching credulity further is the reference to parents losing pay. All UK workers have a statutory right to 5.6 weeks of annual leave paid at their normal rate of pay. If they’ve used all that up halfway into the year (or 3 months into the financial year) – before the start of the school holidays no less – and have to take unpaid leave, then they’re stupid enough to be a cost to the economy without any extra help.

And this isn’t just a fuss about statistics: it’s a central element of one of the narratives around the strikes: that the Government are “deliberately trying to provoke the unions into industrial action so they could blame them for the failure of the Government’s economic strategy.”

If they do, it’ll be a good story. Will journalists let the facts get in the way of it?

UPDATE: An inverse – and equally dubious – claim could be made about the ‘boost’ to the economy from strike action: additional travel and food spending by those attending rallies, and childcare spending by parents who cannot take time off work. It’s like the royal wedding all over again… (thanks to Dan Thornton in the comments for starting this chain of thought)

One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic

Few things illustrate the challenges facing journalism in the age of ‘Big Data’ better than Cable Gate – and specifically, how you engage people with stories that involve large sets of data.

The Cable Gate leaks have been of a different order to the Afghanistan and Iraq war logs. Not in number (there were 90,000 documents in the Afghanistan war logs and over 390,000 in the Iraq logs; the Cable Gate documents number around 250,000) – but in subject matter.

Why is it that the 15,000 extra civilian deaths estimated to have been revealed by the Iraq war logs did not move the US authorities to shut down Wikileaks’ hosting and PayPal accounts? Why did it not dominate the news agenda in quite the same way?

Tragedy or statistic?

I once heard a journalist trying to put the number ‘£13 billion’ into context by saying: “imagine 13 million people paying £1,000 more per year” – as if imagining 13 million people was somehow easier than imagining £13bn. Comparing numbers to the size of Wales or the prime minister’s salary is hardly any better.

Generally misattributed to Stalin, the quote “The death of one man is a tragedy, the death of millions is a statistic” illustrates the problem particularly well: when you move beyond scales we can deal with on a human level, you struggle to engage people in the issue you are covering.

Research suggests this is a problem that not only affects journalism, but justice as well. In October Ben Goldacre wrote about a study that suggested “People who harm larger numbers of people get significantly lower punitive damages than people who harm a smaller number. Courts punish people less harshly when they harm more people.”

“Out of a maximum sentence of 10 years, people who read the three-victim story recommended an average prison term one year longer than the 30-victim readers. Another study, in which a food processing company knowingly poisoned customers to avoid bankruptcy, gave similar results.”

In the US “scoreboard reporting” on gun crime – “represented by numbing headlines like, “82 shot, 14 fatally.”” – has been criticised for similar reasons:

“”As long as we have reporting that gives the impression to everyone that poor, black folks in these communities don’t value life, it just adds to their sense of isolation,” says Stephen Franklin, the community media project director at the McCormick Foundation-funded Community Media Workshop, where he led the “We Are Not Alone” campaign to promote stories about solution-based anti-violence efforts.

“Natalie Moore, the South Side Bureau reporter for the Chicago Public Radio, asks: “What do we want people to know? Are we just trying to tell them to avoid the neighborhoods with many homicides?” Moore asks. “I’m personally struggling with it. I don’t know what the purpose is.””

Salience

This is where journalists play a particularly important role. Kevin Marsh, writing about Wikileaks on Sunday, argues that

“Whistleblowing that lacks salience does nothing to serve the public interest – if we mean capturing the public’s attention to nurture its discourse in a way that has the potential to change something material. “

He is right. But Charlie Beckett, in the comments to that post, points out that Wikileaks is not operating in isolation:

“Wikileaks is now part of a networked journalism where they are in effect, a kind of news-wire for traditional newsrooms like the New York Times, Guardian and El Pais. I think that delivers a high degree of what you call salience.”

This is because last year Wikileaks realised that they would have much more impact working in partnership with news organisations than releasing leaked documents to the world en masse. It was a massive move for Wikileaks, because it meant re-assessing a core principle of openness to all, and taking on a more editorial role. But it was an intelligent move – and undoubtedly effective. The Guardian, Der Spiegel, New York Times and now El Pais and Le Monde have all added salience to the leaks. But could they have done more?

Visualisation through personalisation and humanisation

In my series of posts on data journalism I identified visualisation as one of four interrelated stages in its production. I think that this concept needs to be broadened to include visualisation through case studies: or humanisation, to put it more succinctly.

There are dangers here, of course. Firstly, that humanising a story makes it appear to be an exception (one person’s tragedy) rather than the rule (thousands suffering) – or simply emotive rather than also informative; and secondly, that your selection of case studies does not reflect the more complex reality.

Ben Goldacre – again – explores this issue particularly well:

“Avastin extends survival from 19.9 months to 21.3 months, which is about 6 weeks. Some people might benefit more, some less. For some, Avastin might even shorten their life, and they would have been better off without it (and without its additional side effects, on top of their other chemotherapy). But overall, on average, when added to all the other treatments, Avastin extends survival from 19.9 months to 21.3 months.

“The Daily Mail, the ExpressSky News, the Press Association and the Guardian all described these figures, and then illustrated their stories about Avastin with an anecdote: the case of Barbara Moss. She was diagnosed with bowel cancer in 2006, had all the normal treatment, but also paid out of her own pocket to have Avastin on top of that. She is alive today, four years later.

“Barbara Moss is very lucky indeed, but her anecdote is in no sense whatsoever representative of what happens when you take Avastin, nor is it informative. She is useful journalistically, in the sense that people help to tell stories, but her anecdotal experience is actively misleading, because it doesn’t tell the story of what happens to people on Avastin: instead, it tells a completely different story, and arguably a more memorable one – now embedded in the minds of millions of people – that Roche’s £21,000 product Avastin makes you survive for half a decade.”

Broadcast journalism – with its regulatory requirement for impartiality, often interpreted in practical terms as ‘balance’ – is particularly vulnerable to this. Here’s one example of how the homeopathy debate is given over to one person’s experience for the sake of balance:

Journalism on an industrial scale

The Wikileaks stories are journalism on an industrial scale. The closest equivalent I can think of was the MPs’ expenses story which dominated the news agenda for 6 weeks. Cable Gate is already on Day 9 and the wealth of stories has even justified a live blog.

With this scale comes a further problem: cynicism and passivity; Cable Gate fatigue. In this context online journalism has a unique role to play which was barely possible previously: empowerment.

3 years ago I wrote about 5 Ws and a H that should come after every news story. The ‘How’ and ‘Why’ of that are possibilities that many news organisations have still barely explored. ‘Why should I care?’ is about a further dimension of visualisation: personalisation – relating information directly to me. The Guardian moves closer to this with its searchable database, but I wonder at what point processing power, tools, and user data will allow us to do this sort of thing more effectively.

‘How can I make a difference?’ is about pointing users to tools – or creating them ourselves – where they can move the story on by communicating with others, campaigning, voting, and so on. This is a role many journalists may be uncomfortable with because it raises advocacy issues, but then choosing to report on these stories, and how to report them, raises the same issues; linking to a range of online tools need not be any different. These are issues we should be exploring, ethically.

All the above in one sentence

Somehow I’ve ended up writing over a thousand words on this issue, so it’s worth summing it all up in a sentence.

Industrial scale journalism using ‘big data’ in a networked age raises new problems and new opportunities: we need to humanise and personalise big datasets in a way that does not detract from the complexity or scale of the issues being addressed; and we need to think about what happens after someone reads a story online and whether online publishers have a role in that.

CCTV spending by councils/how many police officers would that pay? – statistics in context

News organisations across the country will today be running stories based on a report by Big Brother Watch into the amount spent on CCTV surveillance by local authorities (PDF). The treatment of this report is a lesson in how journalists approach figures, and why context is more important than raw figures.

BBC Radio WM, for example, led this morning on the fact that Birmingham topped the table of spending on CCTV. But Birmingham is the biggest local authority in the UK by some distance, so this fact alone is not particularly newsworthy – unless, of course, you omit this fact or allow anyone from the council to point it out (ahem).

Much more interesting was the fact that the second biggest spender was Sandwell – also in the Radio WM region. Sandwell spent half as much as Birmingham – but its population is less than a third the size of its neighbour. Put another way, Sandwell spent 80% more per head of population than Birmingham on CCTV (£18 compared to Birmingham’s £10 per head).

Being on a deadline wasn’t an issue here: that information took me only a few minutes to find and work out.

The Press Association’s release on the story focused on the Birmingham angle too – taking the Big Brother Watch statements and fleshing them out with old quotes from those involved in the last big Birmingham surveillance story – the Project Champion scheme – before ending with a top ten list of CCTV spenders.

The Daily Mail, which followed a similar line, at least managed to mention that some smaller authorities (Woking and Breckland) had spent rather a lot of money considering their small populations.

There’s a spreadsheet of populations by local authority here.

How many police officers would that pay for?

A few outlets also repeated the assertions on how many nurses or police officers the money spent on surveillance would have paid for.

The Daily Mail quoted the report as saying that “The price of providing street CCTV since 2007 would have paid for more than 13,500 police constables on starting salaries of just over £23,000″. The Birmingham Mail, among others, noted that it would have paid the salaries of more than 15,000 nurses.

And here we hit a second problem.

The £314m spent on CCTV since 2007 would indeed pay for 13,500 police officers on £23,000 – but only for one year. On an ongoing basis, it would have paid the wages of 4,500 police officers (it should also be pointed out that the £314m figure only covered 336 local authorities – the CCTV spend of those who failed to respond would increase this number).

Secondly, wages are not the only cost of employment, just as installation is not the only cost of CCTV. The FOI request submitted by Big Brother Watch is a good example of this: not only do they ask for installation costs, but operation and maintenance costs, and staffing costs – including pension liabilities and benefits.

There’s a great ‘Employee True Cost Calculator‘ on the IT Centa website which illustrates this neatly: you have to factor in national insurance, pension contributions, overheads and other costs to get a truer picture.

Don’t blame Big Brother Watch

Big Brother Watch’s report is a much more illuminating, and statistically aware, read than the media coverage. Indeed, there’s a lot more information about Sandwell Council’s history in this area which would have made for a better lead story on Radio WM, juiced up the Birmingham Mail report, or just made for a decent story in the Express and Star (which instead simply ran the PA release UPDATE: they led the print edition with a more in-depth story, which was then published online later – see comments).

There’s also more about spending per head, comparisons between councils of different sizes, and between spending on other things*, and spending on maintenance, staffing (where Sandwell comes top) and new cameras – but it seems most reporters didn’t look beyond the first page, and the first name on the leaderboard.

It’s frustrating to see news organisations pass over important stories such as that in Sandwell for the sake of filling column inches and broadcast time with the easiest possible story to write. The result is a homogenous and superficial product: a perfect example of commodified news.

I bet the people at Big Brother Watch are banging their heads on their desks to see their digging reported with so little depth. And I think they could learn something from Wikileaks on why that might be: they gave it to all the media at the same time.

Wikileaks learned a year ago that this free-to-all approach reduced the value of the story, and consequently the depth with which it was reported. But by partnering with one news organisation in each country Wikileaks not only had stories treated more seriously, but other news organisations chasing new angles jealously.

*While we’re at it, the report also points out that the UK spends more on CCTV per head than 38 countries do on defence, and 5 times more in total than Uganda spends on health. “UK spends more on CCTV than Bangladesh does on defence” has a nice ring to me. That said, those defence spending figures turn out to be from 2004 and earlier, and so are not exactly ideal (Wolfram Alpha is a good place to get quick stats like this – and suggests a much higher per capita spend)

Statistics and data journalism: seasonal adjustment for journalists

seasonal adjustment image from Junk Charts

When you start to base journalism around data it’s easy to overlook basic weaknesses in that data – from the type of average that is being used, to distribution, sample size and statistical significance. Last week I wrote about inflation and average wages. A similar factor to consider when looking at any figures is seasonal adjustment.

Kaiser Fung recently wrote a wonderful post on the subject:

“What you see [in the image above] is that almost every line is an inverted U. This means that no matter what year, and what region, housing starts peak during the summer and ebb during the winter.

“So if you compare the June starts with the October starts, it is a given that the October number will be lower than June. So reporting a drop from June to October is meaningless. What is meaningful is whether this year’s drop is unusually large or unusually small; to assess that, we have to know the average historical drop between October and June.

“Statisticians are looking for explanations for why housing starts vary from month to month. Some of the change is due to the persistent seasonal pattern. Some of the change is due to economic factors or other factors. The reason for seasonal adjustments is to get rid of the persistent seasonal pattern, or put differently, to focus attention on other factors deemed more interesting.

“The bottom row of charts above contains the seasonally adjusted data (I have used the monthly rather than annual rates to make it directly comparable to the unadjusted numbers.)  Notice that the inverted U shape has pretty much disappeared everywhere.”

The first point is not to think you’ve got a story because house sales are falling this winter – they might fall every winter. In fact, for all you know they may be falling less dramatically than in previous years.

The second point is to be aware of whether the figures you are looking at have been seasonally adjusted or not.

The final – and hardest – point is to know how to seasonally adjust data if you need to.

For that last point you’ll need to go elsewhere on the web. This page on analysing time series takes you through the steps in Excel nicely. And Catherine Hood’s tipsheet on doing seasonal adjustment on a short time series in Excel (PDF) covers a number of different types of seasonal variation. For more on how and where seasonal adjustment is used in UK government figures check out the results of this search (adapt for your own county’s government domain).