Statistical analysis as journalism – Benford's law

drug-related murder map

I’m always on the lookout for practical applications of statistical analysis for doing journalism, so this piece of work by Diego Valle-Jones, on drug-related murders, made me very happy.

I’ve heard of the first-digit law (also known as Benford’s law) before – it’s a way of spotting dodgy data.

What Diego Valle-Jones has done is use the method to highlight discrepancies in information on drug-delated murders in Mexico. Or, as Pete Warden explains:

“With the help of just Benford’s law and data sets to compare he’s able to demonstrate how the police are systematically hiding over a thousand murders a year in a single state, and that’s just in one small part of the article.”

Diego takes up the story:

“The police records and the vital statistics records are collected using different methodologies: vital statistics from the INEGI [the statistical agency of the Mexican government] are collected from death certificates and the police records from the SNSP are the number of police reports (“averiguaciones previas”) for the crime of murder—not the number of victims. For example, if there happened to occur a particular heinous crime in which 15 teens were massacred, but only one police report were filed, all the murders would be recorded in the database as one. But even taking this into account, the difference is too high.

“You could also argue that the data are provisional—at least for 2008—but missing over a thousand murders in Chihuahua makes the data useless at the state level. I could understand it if it was an undercount by 10%–15%, or if they had added a disclaimer saying the data for Chihuahua was from July, but none of that happened and it just looks like a clumsy way to lie. It’s a pity several media outlets and the UN homicide statistics used this data to report the homicide rate in Mexico is lower than it really is.”

But what brings the data alive is Diego’s knowledge of the issue. In one passage he checks against large massacres since 1994 to see if they were recorded in the database. One of them – the Acteal Massacre (“45 dead, December 22, 1997”)is not there. This, he says, was “committed by paramilitary units with government backing against 45 Tzotzil Indians … According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

The post as a whole is well worth reading in full, both as a fascinating piece of journalism, and a fascinating use of a range of statistical methods. As Pete says, it is a wonder this guy doesn’t get more publicity for his work.

Andrew Marr fails to learn from his own history

“It is frightful that someone who is no one… can set any error into circulation with no thought of responsibility & with the aid of this dreadful disproportioned means of communication”

That’s not a quote from Andrew Marr, but Soren Kierkegaard writing about newspapers in the 19th century. Here’s another:

“I do not mean to be the slightest bit critical of TV newspeople, who do a superb job, considering that they operate under severe time constraints and have the intellectual depth of hamsters.  But TV news can only present the “bare bones” of a story; it takes a newspaper, with its capability to present vast amounts of information, to render the story truly boring”

Strange that the author of one of the best histories of British journalism can fail to remember how each new platform for journalism has been greeted, and how fuzzy the concept of journalism is.

“Journalism includes drunks and dyslexics and some of the least trustworthy, wickedest people in the land … The reader doesn’t know who pretends to make the necessary phone calls, but never bothers; or that this one hates Tories and always writes them down.”

That’s a quote from Andrew Marr’s book. Here’s another:

“In a complicated, developed society, much of the most important finding out can only be done by people with narrower, sharper skills – microbiologists, meteorologists, opinion pollsters and market analysts, whose discoveries journalism simply passes on in a more popular (and generally distorted) form.”

Sounds like bloggers to me.

Marr doesn’t even need to look very far back. This fake-debate was laid to rest years ago (is anyone really claiming that citizen journalism will entirely replace professional journalism? Or still trying to compare blogging – a technical process – with journalism – a cultural construct?). As I tweeted yesterday: the year 2005 called, Andrew. They want their prejudices back.

Meanwhile, Channel 4 journalist Krishnan Guru-Murthy has written eloquently in defence of bloggers and the need to engage through social media.

Revisiting Rodolfo Walsh, father of Argentinian non fiction

For Argentinians like me, it was Rodolfo Walsh – and not Truman Capote, who published In Cold Blood almost a decade later – that invented non fiction journalism with his famous 1957 book Operación Masacre, a masterpiece of investigative journalism.

Twenty years later, on the first anniversary of Jorge Rafael Videla’s dictatorship, he was intercepted by soldiers, murdered, and his remains vanished: he became a “desaparecido”, just after delivering his Open Letter from a Writer to the Military Junta (Carta Abierta de un Escritor a la Junta Militar) to Argentine newspapers and correspondents at foreign media organizations.

OperacionMasacreBook

To commemorate his work, Alvaro Liuzzi is starting a “journalistic experiment” called Proyecto Walsh searching for an answer to an interesting question: “What would have happened if, for the research of Operacion Masacre, Rodolfo Walsh had had access to the digital tools we have today?”.

The Twitter user @rodolfowalsh is the first step of Proyecto Walsh that will try to create an digital ecosystem in order to gather all of the research that Rodolfo accomplished 54 years ago, and remix it using the  journalistic tools of today.

Local newspaper data journalism – school admissions in Birmingham

data journalism at the Birmingham Mail - school admissions data

The Birmingham Mail has been trying its hand at data journalism with school admissions data. It’s a good place to start – the topic attracts a lot of interest (and so justifies the investment of time) while people tend to be interested in more than just who finishes top and bottom of the tables (justifying the choice of medium).

The results are impressive. Applications data is plotted on a Google map on the main page, while an “interactive chart” page allows you to compare schools across various criteria, and also narrow the sample by selecting from two drop down menus (town and school).

The charts have been made in Tableau, which includes a download link at the bottom. However, you need Tableau itself (free, but PC only) to open it.

A further page features links to tables for each area. Sadly, the pages containing tables do not contain any link to the raw data. This presents an extra hurdle to users – although you can scrape the table into a Google spreadsheet using the =import formula. If you want to see how, here’s a spreadsheet I created from the data by doing just that. Click on the first cell to see the formula that generates it.

I asked David Higgerson, Trinity Mirror’s Head of Multimedia and the man whose name appears on the Tableau data, to explain the process behind the project. It seems the information was a combination of freely available data and that acquired via FOI.

“The Mail took the data available – number of places available, number of first choice applicants and number of total applicants – and worked out a ratio of first choice applicants per place. This is relevant to parents because councils try to allocate places to children based on preference once they’ve decided which schools a child is eligible for. Eligibility varies depending on type of school.

“The figures showed how popular faith schools were, and also how fierce competition was for places at grammar schools. That’s the story which generated most interest.

“As you’ve said on your blog, the hardest part was making the data uniform, and the making it relevant to readers.

“In print, it ran across three days. Day one was grammar schools, day two was all schools and day three revealed how catchment areas for oversubscribed schools which use distance from school to fill their last few places.

“Online, Google Fusion was used to create maps, Tableau for the interactive chart which lets people choose based on town or school, and Tableizer for the quick tables which appear in the section too. We also had a play with Scribble Maps, which we think has real potential for print/online newsrooms.”

It seems education reporter Kat Keogh deserves the credit for spotting the stories in the data, “with the usual support you’d expect in the newsroom – newsdesk etc.”

David and Anna Jeys experimented with the online presentation and others laid out the data for print.

BBC new linking guidelines issued – science journals mentioned

The BBC have just emailed new linking guidelines to their staff. They stipulate that linking is “essential” to online journalism and in one slide (it’s a PowerPoint document) titled ‘If you remember nothing else’ highlight how linking will change:

What we used to do…

  • Lists of archive news stories
  • Homepages only on external websites
  • No inline linking in news stories

What we do now – think adding value…

  • Avoid news stories and link to useful stuff – analysis, explainers, Q&As, pic galleries etc
  • On external websites look beyond homepage to pages of specific relevance
  • Inline linking in news stories is OK when it’s to a primary source

Other points of note in the document include the repeated emphasis on useful deep linking, and the importance of the newstracker module (which links to coverage on other news sites). Curiously, when referring to inline links it does say that “different rules can apply” to BBC blogs – “speak to blogs team if in doubt”.

Something I did look for – and find – was a reference to linking to scientific journals. And here it is: “In news stories inline links must go to primary sources only– eg scientific journal article or policy report (1 or 2 per story; avoid intro)”

This is significant given the previous campaigning on this issue.

On the whole it’s a good set of guidance – I’ll refrain from publishing it in hope that the BBC will…

UPDATE: It seems The Guardian followed up the story and embedded the document, so here it is:

BBC guidelines for linking – Sept 2010

‘Making it findable’ – the creed of the hyperlocal blogger

I’ve written a post over at Podnosh.com (full disclosure: where I do some training and consultancy) on ‘Making it findable’ – the creed of the hyperlocal blogger, reporting on a discussion berween hyperlocal bloggers and local government officials at Hyperlocal Govcamp West Midlands. The meat of what I’m saying is in the middle:

“I noticed a recurring theme from the bloggers’ perspective on their role – something unique to online journalism, and which I’ve commented on before: the duty to make things findable.

“Bloggers repeatedly referred to information about the local democratic process that was hidden away on council websites – and which they worked hard to make available and interesting to their community. Council meeting times; minutes; planning meetings.

“At one point someone said that the bloggers were there to “hold power to account”. Not always in the active sense of posing difficult questions – but also in making the invisible visible; the obscure findable.

“By doing so they are not only shedding a light on the workings of local government, but transferring power. “This is your responsibility”, it says – not “This is my story”.”

There’s a nice comment below saying it “is the closest anyone, including me – has ever got to stating what my blog is about.” Full post here.

Online journalism student RSS reader starter pack: 50 RSS feeds

Teaching has begun in the new academic year and once again I’m handing out a list of recommended RSS feeds. Last year this came in the form of an OPML file, but this year I’m using Google Reader bundles (instructions on how to create one of your own are here). There are 50 feeds in all – 5 feeds in each of 10 categories. Like any list, this is reliant on my own circles of knowledge and arbitrary in various respects. But it’s a start. I’d welcome other suggestions.

Here is the list with links to the bundles. Each list is in alphabetical order – there is no ranking:

5 of the best: Community

A link to the bundle allowing you to add it to your Google Reader is here.

  1. Blaise Grimes-Viort
  2. Community Building & Community Management
  3. FeverBee
  4. ManagingCommunities.com
  5. Online Community Strategist

5 of the best: Data

This was a particularly difficult list to draw up – I went for a mix of visualisation (FlowingData), statistics (The Numbers Guy), local and national data (CountCulture and Datablog) and practical help on mashups (OUseful). I cheated a little by moving computer assisted reporting blog Slewfootsnoop into the 5 UK feeds and 10,000 Words into Multimedia. Bundle link here. Continue reading

Interview: Ton Zijlstra on open data in the EU (audio)

A couple weeks ago I spoke at the PICNIC festival in Amsterdam. While I was there I grabbed an interview with Ton Zijlstra, who has been following open data developments across EU governments very closely. You can find the interview embedded below:

[audio:http://audioboo.fm/boos/186944-ton-zijlstra-on-open-data-in-the-eu.mp3%5D

Something I wrote for the Guardian Datablog (and caveats)

I’ve written a piece on ‘How to be a data journalist’ for The Guardian’s Datablog. It seems to have proven very popular, but I thought I should blog briefly about it if you haven’t seen one of those tweets.

The post is necessarily superficial – it was difficult enough to cover the subject area for a 12,000-word book chapter, so summarising further into a 1,000 word article was almost impossible.

In the process I had to leave a huge amount out, compensating slightly by linking to webpages which expanded further.

Visualising and mashing, as the more advanced parts of data journalism, suffered most, because it seemed to me that locating and understanding data necessarily took precedence.

Heather Billings, for example, blogged about my “very British footnote [which was the] only nod to visual presentation”. If you do want to know more about visualisation tips, I wrote 1,000 words on that alone here. There’s also this great post by Kaiser Fung – and the diagram below, of which Fung says: “All outstanding charts have all three elements in harmony. Typically, a problematic chart gets only two of the three pieces right.”:

Trifecta checkup

On Monday I blogged the advice on where aspiring data journalists should start in full. There’s also the selection of passages from the book chapter linked above. And my Delicious bookmarks on data journalism, visualisation and mashups. Each has an RSS feed.

I hope that helps. If you do some data journalism as a result, it would be great if you could let me know about it – and what else you picked up.

Hyperlocal voices: Adirondack Almanack / John Warren

hyperlocal voices - Adirondack Almanack, John Warren

Following a nomination via the Online Journalism Blog Facebook group, this Hyperlocal Voices looks at a US blog: the Adirondack Almanack, which covers the rural Adirondack region of upstate New York.

Launched in 2005 out of frustration with the lack of coverage from the mainstream media, the site now boasts 20 contributors, “mostly veteran local writers, journalists, and editors and includes media professionals from local radio, magazines, and newspapers,” says founder John Warren. Here’s the full interview with John:

What made you decide to set up the blog?

The Adirondacks is home to the largest park and the largest state-level protected area in the contiguous United States (it’s also the largest National Historic Landmark). The park is over 6 million acres in size (that makes it bigger than Vermont, or Yellowstone, Yosemite, Grand Canyon, Glacier, and Great Smoky Mountains National Parks combined).

However, about half the land is publicly owned and the rest privately owned, including several villages. That mix of public and private land makes the Park a unique area and fodder for some heated discussions over sustainable development, wilderness, environmental and outdoor recreation issues. I felt strongly that local news media was not fully representing the variety of perspectives on these important issues – many of which are important in other parts of the country as well. Continue reading