The inverted pyramid of data journalism

I’ve been working for some time on picking apart the many processes which make up what we call data journalism. Indeed, if you read the chapter on data journalism (blogged draft) in my Online Journalism Handbook, or seen me speak on the subject, you’ll have seen my previous diagram that tries to explain those processes.

I’ve now revised that considerably, and what I’ve come up with bears some explanation. I’ve cheekily called it the inverted pyramid of data journalism, partly because it begins with a large amount of information which becomes increasingly focused as you drill down into it until you reach the point of communicating the results.

What’s more, I’ve also sketched out a second diagram that breaks down how data journalism stories are communicated – an area which I think has so far not been very widely explored. But that’s for a future post.

I’m hoping this will be helpful to those trying to get to grips with data, whether as journalists, developers or designers. This is, as always, work in progress so let me know if you think I’ve missed anything or if things might be better explained.

UPDATE: Also in Spanish.

The inverted pyramid of data journalism

Inverted pyramid of data journalism Paul Bradshaw

Here are the stages explained:

Compile

Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning. Whichever it is, the compilation of data is what defines it as an act of data journalism.

Compiling data can take various forms. At its most simple the data might be:

  1. supplied directly to you by an organisation (how long until we see ‘data releases’ alongside press releases?),
  2. found through using advanced search techniques to plough into the depths of government websites;
  3. compiled by scraping databases hidden behind online forms or pages of results using tools like OutWit Hub and Scraperwiki;
  4. by converting documents into something that can be analysed, using tools like DocumentCloud;
  5. by pulling information from APIs;
  6. or by collecting the data yourself through observation, surveys, online forms or crowdsourcing.

This compilation stage is the most important – not only because everything else rests on that, but because it is probably the stage that is returned to the most – at each of the subsequent stages – cleaning, contextualising, combining and communicating – it may be that you need to compile further information.

Clean

Having data is just the beginning. Being confident in the stories hidden within it means being able to trust the quality of the data – and that means cleaning it.

Cleaning typically takes two forms: removing human error; and converting the data into a format that is consistent with other data you are using.

For example, datasets will often include some or all of the following: duplicate entries; empty entries; the use of default values to save time or where no information was held; incorrect formatting (e.g. words instead of numbers); corrupted entries or entries with HTML code; multiple names for the same thing (e.g. BBC and B.B.C. and British Broadcasting Corporation); and missing data (e.g. constituency). You can probably suggest others.

There are simple ways to clean up data in Excel or Google Docs such as find and replace, sorting to find unusually high, low, or empty entries, and using filters so that only duplicate entries (i.e. those where a piece of data occurs more than once) are shown.

Google Refine adds a lot more power: its ‘common transforms’ function will, for example, convert all entries to lowercase, uppercase or titlecase. It can remove HTML, remove spaces before and after entries (which you can’t see but which computers will see as different to the same data without a space), remove double spaces, join and split cells, and format them consistently. It will also ‘cluster’ entries and allow you to merge those which should be the same. Note: this will work for BBC and B.B.C. but not BBC and British Broadcasting Corporation, so some manual intervention is often needed.

Context

Like any source, data cannot always be trusted. It comes with its own histories, biases, and objectives. So like any source, you need to ask questions of it: who gathered it, when, and for what purpose? How was it gathered? (The methodology). What exactly do they mean by that?

You will also need to understand jargon, such as codes that represent categories, classifications or locations, and specialist terminology.

All the above will most likely lead you to compile further data. For example, knowing the number of crimes in a city is interesting, but only becomes meaningful when you contextualise that alongside the population, or the numbers of police, or the levels of crime 5 years ago, or perceptions of crime, or levels of unemployment, and so on. Statistical literacy is a must here – or at least show your work to someone who has read Ben Goldacre’s book.

Having a clear question at the start of the whole process, by the way, helps ensure you don’t lose your focus at this point, or miss an interesting angle.

Combine

Good stories can be found in a single dataset, but often you will need to combine two together. After all, given the choice between a single-source story and a multiple-source one, which would you prefer?

The classic combination is the maps mashup: taking one dataset and combining it with map data to provide an instant visualisation of how something is distributed in space: where are the cuts hitting hardest? Which schools are performing best? What are the most talked-about topics around the world on Twitter right now?

This is so common (largely because the Google Maps API was one of the first journalistically useful APIs) it has almost become a cliche. But still, cliches are often – if not always – effective.

A more mundane combination is to combine two or more datasets with a common data point. That might be a politican’s name, for example, or a school, or a location.

This often means ensuring that the particular data point is formatted in the same name across each dataset.

In one, for example, the first and last names might have separate columns, but not in the other (you can concatenate or split cells to solve this).

Or you might have local authority names in one, but local authority codes in another (find another dataset that has both together and use a tool like Google Fusion Tables to merge them).

One might use latitude and longitude; another postcodes, or easting and northing (a postcodes API and Google Refine can help). But once you’ve got them formatted right, you may find some interesting stories or leads for further questions to ask.

Communicate

In data journalism the all-too-obvious thing to do at this point is to visualise the results – on a map, in a chart, an infographic, or an animation. But there’s a lot more here to consider – from the classic narrative, to news apps, case studies and personalisation. In fact there’s so much in this stage alone that I’ve written a separate post (diagram below). Meanwhile, comments very much welcome.

The inverted pyramid of data journalism and data journalism communication pyramid

Advertisements

40 thoughts on “The inverted pyramid of data journalism

  1. Pingback: A pirâmide invertida do jornaliso de dados : Ponto Media

  2. Pingback: » The inverted pyramid of data journalism | Online Journalism Blog Media Strategery

  3. Pingback: La pirámide invertida del Periodismo de Datos « tejiendo redes

  4. Pingback: Kataweb.it - Blog - SNODI di Federico Badaloni » Blog Archive » La Piramide Rovesciata del Data Journalism

  5. Pingback: The inverted pyramid of data journalism – in Spanish | Online Journalism Blog

  6. Pingback: links for 2011-07-08 | Beyond the Echo Chamber

  7. Pingback: This Week in Review: What Google+ could do for news, and Murdoch’s News of the World gets the ax » Nieman Journalism Lab » Pushing to the Future of Journalism

  8. Pingback: La pirámide invertida del periodismo de datos – Javier de Vega |

  9. Pingback: Data journalism: the inverted pyramid of Paul Bradshaw |

  10. Pingback: links for 2011-07-11 : The ChipCast || by Chip Mahaney

  11. Pingback: Daily bookmarks & places archive | Chipcinnati

  12. Pingback: Global Editors Network

  13. Pingback: Formas de comunicar información (la pirámide invertida del periodismo de datos p2) « tejiendo redes

  14. Pingback: Conférence Lift 11 « Be Radical » Morceaux choisis – 2ème partie « Avec ou Sans Contact

  15. Pingback: When information is power, these are the questions we should be asking | Online Journalism Blog

  16. John Rennie

    The aside, ‘how long before we see data releases alongside press releases’ is interesting. As a journalist who has also written plenty of press releases, I’d observe that the point of a PR is to write the story with the angle you want … so that journos who are lazy/rushed/or simply in need of a good story are presented with just that, and change next to nothing between release and publication. The LAST thing you want the journo to do is to dig, consider and make up his/her own story/angle!

    Reply
  17. DJ

    The graphics are a perfect illustration. I think data journalism is the way of the future. Although it’s true what you say about data having its own bias and context, I think it’s going to be the new way to achieve objective journalism.

    Reply
  18. Pingback: Data journalism: raccogliere, verificare e comunicare le informazioni in rete | Senzamegafono

  19. Pingback: LSDI : Data journalism/2: le due piramidi di Paul Bradshaw

  20. Pingback: Daniela Osvald Ramos » Blog Archive » Por onde naveguei?

  21. Pingback: 6 ways of communicating data journalism (The inverted pyramid of data journalism part 2) - Data Journalism Blog | Data Journalism Blog

  22. Pingback: This is What Data Cleaning Looks Like | Data Collective: Blog

  23. Pingback: La pirámide invertida del periodismo de datos | Javier de Vega

  24. Pingback: OER Visualisation Project: Data Driven Journalism [day 16] #ukoer – MASHe

  25. Pingback: The inverted pyramid of data journalism | Online Journalism Blog | Computational and Data Journalism | Scoop.it

  26. Pingback: Visualization in the data journalism workflow » Interactives

  27. Pingback: OER Visualisation Project: Fin [day 40.5] – MASHe

  28. Pingback: Data and Media – An Unrequited Love? « Advanced Media Issues – ARTS3091 Luis C

  29. Pingback: How to make data look good: Food Standards Agency Ratings | Andrew Stuart Media

  30. Pingback: TKTK « Fenton | Progress Accelerated

  31. Pingback: ligas prueba | RDataVox

  32. Pingback: Links | RDataVox

  33. Pingback: 6 ways of communicating data journalism (The inverted pyramid of data journalism part 2) | Online Journalism Blog

  34. Pingback: ‘Escribir en internet’, referencias | ESTILO, Manual de estilo para los nuevos medios

  35. Pingback: The inverted pyramid of data journalism | Online Journalism Blog « All around Open Knowledge

  36. Pingback: Swimming in the data stream – a 5-step guide to data journalism | Dataliser

  37. Pingback: Edmund Tadros » Blog Archive » Swimming in the data stream: a 5-step guide to data journalism

  38. Pingback: What type of visualisation should you use to tell your data-driven story? | Passionately Curious

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s