Tag Archives: adrian short

Is there a ‘canon’ of data journalism? Comment call!

Looking across the comments in the first discussion of the EJC’s data journalism MOOC it struck me that some pieces of work in the field come up again and again. I thought I’d pull those together quickly here and ask: is this the beginnings of a ‘canon’ in data journalism? And what should such a canon include? Stick with me past the first obvious examples…

Early data vis

These examples of early data visualisation are so well-known now that one book proposal I recently saw specified that it would not talk about them. I’m talking of course about… Continue reading

A case study in online journalism: investigating the Olympic torch relay

Infographic: Where did the Olympic torch relay places go? What we know so far

image by @CarolineBeavon

For the last two months I’ve been involved in an investigation which has used almost every technique in the online journalism toolbox. From its beginnings in data journalism, through collaboration, community management and SEO to ‘passive-aggressive’ newsgathering,  verification and ebook publishing, it’s been a fascinating case study in such a range of ways I’m going to struggle to get them all down.

But I’m going to try.

Data journalism: scraping the Olympic torch relay

The investigation began with the scraping of the official torchbearer website. It’s important to emphasise that this piece of data journalism didn’t take place in isolation – in fact, it was while working with Help Me Investigate the Olympics‘s Jennifer Jones (coordinator for#media2012, the first citizen media network for the Olympic Games) and others that I stumbled across the torchbearer data. So networks and community are important here (more later).

Indeed, it turned out that the site couldn’t be scraped through a ‘normal’ scraper, and it was the community of the Scraperwiki site – specifically Zarino Zappia – who helped solve the problem and get a scraper working. Without both of those sets of relationships – with the citizen media network and with the developer community on Scraperwiki – this might never have got off the ground.

But it was also important to see the potential newsworthiness in that particular part of the site. Human stories were at the heart of the torch relay – not numbers. Local pride and curiosity was here – a key ingredient of any local newspaper. There were the promises made by its organisers – had they been kept?

The hunch proved correct – this dataset would just keep on giving stories.

The scraper grabbed details on around 6,000 torchbearers. I was curious why more weren’t listed – yes, there were supposed to be around 800 invitations to high profile torchbearers including celebrities, who might reasonably be expected to be omitted at least until they carried the torch – but that still left over 1,000.

I’ve written a bit more about the scraping and data analysis process for The Guardian and the Telegraph data blog. In a nutshell, here are some of the processes used:

  • Overview (pivot table): where do most come from? What’s the age distribution?
  • Focus on details in the overview: what’s the most surprising hometown in the top 5 or 10? Who’s oldest and youngest? What about the biggest source outside the UK?
  • Start asking questions of the data based on what we know it should look like – and hunches
  • Don’t get distracted – pick a focus and build around it.

This last point is notable. As I looked for mentions of Olympic sponsors in nomination stories, I started to build up subsets of the data: a dozen people who mentioned BP, two who mentioned ArcelorMittal (the CEO and his son), and so on. Each was interesting in its own way – but where should you invest your efforts?

One story had already caught my eye: it was written in the first person and talked about having been “engaged in the business of sport”. It was hardly inspirational. As it mentioned adidas, I focused on the adidas subset, and found that the same story was used by a further six people – a third of all of those who mentioned the company.

Clearly, all seven people hadn’t written the same story individually, so something was odd here. And that made this more than a ‘rotten apple’ story, but something potentially systemic.

Signals

While the data was interesting in itself, it was important to treat it as a set of signals to potentially more interesting exploration. Seven torchbearers having the same story was one of those signals. Mentions of corporate sponsors was another.

But there were many others too.

That initial scouring of the data had identified a number of people carrying the torch who held executive positions at sponsors and their commercial partners. The GuardianThe Independent and The Daily Mail were among the first to report on the story.

I wondered if the details of any of those corporate torchbearers might have been taken off off the site afterwards. And indeed they had: seven disappeared entirely (many still had a profile if you typed in the URL directly – but could not be found through search or browsing), and a further two had had their stories removed.

Now, every time I scraped details from the site I looked for those who had disappeared since the last scrape, and those that had been added late.

One, for example – who shared a name with a very senior figure at one of the sponsors – appeared just once before disappearing four days later. I wouldn’t have spotted them if they – or someone else – hadn’t been so keen on removing their name.

Another time, I noticed that a new torchbearer had been added to the list with the same story as the 7 adidas torchbearers. He turned out to be the Group Chief Executive of the country’s largest catalogue retailer, providing “continuing evidence that adidas ignored LOCOG guidance not to nominate executives.”

Meanwhile, the number of torchbearers running without any nomination story went from just 2.7% in the first scrape of 6,056 torchbearers, to 7.2% of 6,891 torchbearers in the last week, and 8.1% of all torchbearers – including those who had appeared and then disappeared – who had appeared between the two dates.

Many were celebrities or sportspeople where perhaps someone had taken the decision that they ‘needed no introduction’. But many also turned out to be corporate torchbearers.

By early July the numbers of these ‘mystery torchbearers’ had reached 500 and, having only identified a fifth, we published them through The Guardian datablog.

There were other signals, too, where knowing the way the torch relay operated helped.

For example, logistics meant that overseas torchbearers often carried the torch in the same location. This led to a cluster of Chinese torchbearers in StanstedHungarians in Dorset,Germans in BrightonAmericans in Oxford and Russians in North Wales.

As many corporate torchbearers were also based overseas, this helped narrow the search, with Germany’s corporate torchbearers in particular leading to an article in Der Tagesspiegel.

I also had the idea to total up how many torchbearers appeared each day, to identify days when details on unusually high numbers of torchbearers were missing – thanks to Adrian Short – but it became apparent that variation due to other factors such as weekends and the Jubilee made this worthless.

However, the percentage per day missing stories did help (visualised below by Caroline Beavon), as this also helped identify days when large numbers of overseas torchbearers were carrying the torch. I cross-referenced this with the ‘mystery torchbearer’ spreadsheet to see how many had already been checked, and which days still needed attention.

But the data was just the beginning. In the second part of this case study, I talk about the verification process, SEO and collaboration.

A case study in online journalism: investigating the Olympic torch relay

Infographic: Where did the Olympic torch relay places go? What we know so far

For the last two months I’ve been involved in an investigation which has used almost every technique in the online journalism toolbox. From its beginnings in data journalism, through collaboration, community management and SEO to ‘passive-aggressive’ newsgathering,  verification and ebook publishing, it’s been a fascinating case study in such a range of ways I’m going to struggle to get them all down.

But I’m going to try. Continue reading

20 free ebooks on journalism (for your Xmas Kindle)

For some reason there are two versions of this post on the site – please check the more up to date version here.

20 free ebooks on journalism (for your Xmas Kindle) {updated to 65}

Journalism 2.0 cover

As many readers of this blog will have received a Kindle for Christmas I thought I should share my list of the free ebooks that I recommend stocking up on.

Online journalism and multimedia ebooks

Starting with more general books, Mark Briggs‘s book Journalism 2.0 (PDF*) is a few years old but still provides a good overview of online journalism to have by your side. Mindy McAdams‘s 42-page Reporter’s Guide to Multimedia Proficiency (PDF) adds some more on that front, and Adam Westbrook‘s Ideas on Digital Storytelling and Publishing (PDF) provides a larger focus on narrative, editing and other elements.

After the first version of this post, MA Online Journalism student Franzi Baehrle suggested this free book on DSLR Cinematography, as well as Adam Westbrook on multimedia production (PDF). And Guy Degen recommends the free ebook on news and documentary filmmaking from ImageJunkies.com.

The Participatory Documentary Cookbook [PDF] is another free resource on using social media in documentaries.

A free ebook on blogging can be downloaded from Guardian Students when you register with the site, and Swedish Radio have produced this guide to Social Media for Journalists (in English).

The Traffic Factories is an ebook that explores how a number of prominent US news organisations use metrics, and Chartbeat’s role in that. You can download it in mobi, PDF or epub format here.

Continue reading

How private is a tweet?

The PCC has made its first rulings on a complaint over newspapers republishing a person’s tweets. The background to this is the publication in The Daily Mail and the Independent on Sunday of tweets by civil servant Sarah Baskerville. Adrian Short sums up the stories pretty nicely: “We could be forgiven for thinking you’re trying to make the news rather than report it.”

The complaint came under the headings of privacy and accuracy. In a nutshell, the PCC have not upheld the complaints and, in the process, decided that a public Twitter account is not private. That seems fair enough. However, it is noted that “her Twitter account and her blog [which the Independent quoted from, along with her Flickr account] both included clear disclaimers that the views expressed were personal opinions and were not representative of her employer.”

The wider issue is of course about privacy as a whole, and about the relationship between our professional and private lives. The stories – as Adrian Short outlines so well – are strangely self-contained. ‘It is terrible that this civil servant has opinions and drinks occasionally, because someone like me might say that is it terrible…’

Next they’ll be saying that journalists have opinions and drink too…

Why journalists should be lobbying over police.uk’s crime data

UK police crime maps

Conrad Quilty-Harper writes about the new crime data from the UK police force – and in the process adds another straw to the groaning camel’s back of the government’s so-called transparency agenda:

“It’s useless to residents wanting to find out what was going on at the house around the corner at 3am last night, and it’s useless to individuals who want to build mobile phone applications on top of the data (perhaps to get a chunk of that £6 billion industry open data is supposed to create).

“The site’s limitations are as follows:

  • No IDs for crimes: what if I want to check whether real life crimes have made it onto the map? Sorry.
  • Six crime categories: including “other crimes”, everything from drug dealing to bank robberies in one handy, impossible to understand category.
  • No live data: you mean I have to wait until the end of the next month to see this month’s criminality?!
  • No dates or times: funny how without dates and times I can’t tell which police manager was in charge.
  • Case status: the police know how many crimes go solved or unsolved, why not tell us this?”

This is why people are so concerned about the Public Data Corporation. This is why we need to be monitoring exactly what spending data councils release, and in what format. And this is why we need to continue to press for the expansion of FOI laws. This is what we should be doing. Are we?

UPDATE: Will Perrin has FOI’d all correspondence relating to ICO advice on the crime maps. Jonathan Raper has a list of further flaws including:

  • Some data such as sexual offences and murder is removed – even though it would be easy to discover and locate from other police reports.
  • Data covers reported crimes rather than convictions, so some of it may turn out not to be crime.
  • The levels of policing are not provided, so that two areas with the “same” crime levels may in fact have “radically different” experiences of crime and policing.

Charles Arthur notes that: “Police forces have indicated that whenever a new set of data is uploaded – probably each month – the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it.”

Louise Kidney says:

“What we’ve actually got with http://www.police.uk is neither one nor the other. Ruth looks like a crime overlord cos of all the crimes happening in her garden and we haven’t got exact point data, but we haven’t got first part of postcode data either e.g. BB5 crimes or NW1 crimes. Instead, we’ve got this weird halfway house thing where it’s not accurate, but its inaccuracy almost renders it useless because we don’t have any idea if every force uses the same parameters when picking these points, we don’t know how they pick their points, we don’t know what we don’t know in terms of whether one house in particular is causing a considerable issue with anti-social behaviour for example, allowing me to go to my local Council and demand they do something about it.”

Adrian Short argues that “What we’re looking at here isn’t a value-neutral scientific exercise in helping people to live their daily lives a little more easily, it’s an explicitly political attempt to shape the terms of a debate around the most fundamental changes in British policing in our lifetimes.”

He adds:

“It’s derived data that’s already been classified, rounded and lumped together in various ways, with a bit of location anonymising thrown in for good measure. I haven’t had a detailed look at it yet but I would caution against trying to use it for anything serious. A whole set of decisions have already transformed the raw source data (individual crime reports) into this derived dataset and you can’t undo them. You’ll just have to work within those decisions and stay extremely conscious that everything you produce with it will be prefixed, “as far as we can tell”.

“£300K for this? There ought to be a law against it.”

UPDATE 2: One frustrated developer has launched CrimeSearch.co.uk to provide “helpful information about crime and policing in your area, without costing 300k of tax payers’ money”

Don't stop us digging into public spending data

A disturbing discovery by Chris Taggart last week: a number of councils in the UK are handing over their ‘open’ data to a company which only allows it to be downloaded for “personal” use.

As Chris himself points out, this runs completely against the spirit of the push to release public data in a number of ways:

  • Data cannot be used for “commercial gain”. This includes publishers wanting to present the information in ways that make most sense to the reader, and startups wanting to find innovative ways to involve people in their local area. Oh, and that whole ‘Big Society‘ stuff.
  • The way the sites are built means you couldn’t scrape this information with a computer anyway
  • It’s only a part of the data. “Download the data from SpotlightOnSpend and it’s rather different from the published data [on the Windsor & Maidenhead site]. Different in that it is missing core data that is in W&M published data (e.g. categories), and that includes data that isn’t in the published data (e.g. data from 2008).”

It’s a worrying path. As Chris sums it up: ” Councils hand over all their valuable financial data to a company which aggregates for its own purposes, and, er, doesn’t open up the data, shooting down all those goals of mashing up the data, using the community to analyse and undermining much of the good work that’s been done.”

The Transparency Board quickly issued a statement about this issue saying that “urgent” measures are taking place to rectify the problem.

And Spikes Cavell, who make the software, responded in Information Age, pointing out that “it is first and foremost a spend analysis software and consultancy supplier, and that it publishes data through SpotlightOnSpend as a free, optional and supplementary service for its local government customers. The hope is that this might help the company to win business, he explains, but it is not a money-spinner in itself.”

They are now promising to make the data available for download in its “raw form”, although it’s not clear what that will be. Adrian Short’s comment to the piece is worth reading.

Nevertheless, this is an issue that anyone interested in holding power to account should keep a close eye on. And to that aim, Chris has started an investigation on Help Me Investigate to find out how and why councils are giving access to their spending data. Please join it and help here.

(Comment or email me on paul at helpmeinvestigate.com if you want an invitation.)

Data journalism pt2: Interrogating data

This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.

UPDATE: It has now been published in The Online Journalism Handbook.

“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)

Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals. Continue reading

Crowdsourcing thoughts on council newspapers: #councilpapers

The previous two posts on the role of local authorities in regional news sparked a bit of crowdsourcing on Twitter: “Do you think your council newspaper is worth having?” I asked. The responses, tagged #councilpapers, can be seen at this Twitter search. Below you will find a Wordle cloud of tagged tweetsand a Twickie compilation of the first dozen or so responses.

In addition, Adrian Short suggested people bookmark council papers on Delicious with the tag ‘councilpapers’ – you can see these here. If yours isn’t listed, please add it. Continue reading