UPDATE: A new version of the inverted pyramid, with resources on each stage, is now available. Also available in German, Spanish, Portuguese, Finnish, Russian and Ukrainian.
I’ve been working for some time on picking apart the many processes which make up what we call data journalism. Indeed, if you read the chapter on data journalism (blogged draft) in my Online Journalism Handbook, or seen me speak on the subject, you’ll have seen my previous diagram that tries to explain those processes.
I’ve now revised that considerably, and what I’ve come up with bears some explanation. I’ve cheekily called it the inverted pyramid of data journalism, partly because it begins with a large amount of information which becomes increasingly focused as you drill down into it until you reach the point of communicating the results.
What’s more, I’ve also sketched out a second diagram that breaks down how data journalism stories are communicated – an area which I think has so far not been very widely explored. But that’s for a future post.
I’m hoping this will be helpful to those trying to get to grips with data, whether as journalists, developers or designers. This is, as always, work in progress so let me know if you think I’ve missed anything or if things might be better explained.
UPDATE: Also in Spanish, German, Finnish, Russian and Ukrainian.).
The inverted pyramid of data journalism
Here are the stages explained:
Compile
Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning. Whichever it is, the compilation of data is what defines it as an act of data journalism.
Compiling data can take various forms. At its most simple the data might be:
- supplied directly to you by an organisation (how long until we see ‘data releases’ alongside press releases?),
- found through using advanced search techniques to plough into the depths of government websites;
- compiled by scraping databases hidden behind online forms or pages of results using tools like OutWit Hub and Scraperwiki;
- by converting documents into something that can be analysed, using tools like DocumentCloud;
- by pulling information from APIs;
- or by collecting the data yourself through observation, surveys, online forms or crowdsourcing.
This compilation stage is the most important – not only because everything else rests on that, but because it is probably the stage that is returned to the most – at each of the subsequent stages – cleaning, contextualising, combining and communicating – it may be that you need to compile further information.
Clean
Having data is just the beginning. Being confident in the stories hidden within it means being able to trust the quality of the data – and that means cleaning it.
Cleaning typically takes two forms: removing human error; and converting the data into a format that is consistent with other data you are using.
For example, datasets will often include some or all of the following: duplicate entries; empty entries; the use of default values to save time or where no information was held; incorrect formatting (e.g. words instead of numbers); corrupted entries or entries with HTML code; multiple names for the same thing (e.g. BBC and B.B.C. and British Broadcasting Corporation); missing data (e.g. constituency); mixed data in the same column; or data in the wrong shape (e.g. swapping columns and rows). You can probably suggest others.
There are simple ways to clean up data in Excel or Google Sheets such as find and replace, sorting to find unusually high, low, or empty entries, and using filters so that only duplicate entries (i.e. those where a piece of data occurs more than once) are shown. My book Finding Stories in Spreadsheets has a number of chapters covering ways to use those techniques.
Google Refine adds a lot more power: its ‘common transforms’ function will, for example, convert all entries to lowercase, uppercase or titlecase. It can remove HTML, remove spaces before and after entries (which you can’t see but which computers will see as different to the same data without a space), remove double spaces, join and split cells, and format them consistently. It will also ‘cluster’ entries and allow you to merge those which should be the same. Note: this will work for BBC and B.B.C. but not BBC and British Broadcasting Corporation, so some manual intervention is often needed.
Context
Like any source, data provides one perspective on a story and cannot always be trusted. It comes with its own histories, biases, and objectives.
So like any source, you need to ask questions of it: who gathered it, when, and for what purpose? How was it gathered? (The methodology). What exactly do they mean by that?
You may also need to understand jargon, such as codes that represent categories, classifications or locations, and specialist terminology.
Establishing the context for a dataset may lead you to compile further data. For example, knowing the number of crimes reported in a city is interesting, but only becomes meaningful when you put that into context.
That context could be the size of the population, or the numbers of police, or the levels of crime five years ago. It could be the context of experiences or perceptions of crime, conviction rates, or levels of unemployment. It could be the context of definitions of crime categories.
Some basic statistical literacy goes a long way, so spend a little time reading about it. How To Lie With Statistics is a short and easy to read classic in the field, which is complemented by the more recent How to Make The World Add Up. And Ben Goldacre’s book Bad Science prepares you for covering data about health and science.
Add the BBC podcast More or Less to your subscriptions to hear them discuss stories about data, and check out The Tiger That Isn’t, by the presenters.
Combine
Good stories can be found in a single dataset, but often you will need to combine two together. After all, given the choice between a single-source story and a multiple-source one, which would you prefer?
Combining data is often part of putting it into context: you will regularly want to combine data on the number of events in different areas with data on the populations in those areas, for example, in order to rank the per capita frequency of those events. Data on inflation will have to be combined with spending data to adjust figures from different years.
Sometimes you need to combine data to get the answer to a question:
- New data will need to be combined with older data to tell a story about change.
- Data on the performance of schools or hospitals will need to be combined with data on the locations of those institutions, in order to tell a story ranking the areas with the best or worst average performances, or simply show the extent of variation.
- To create an exploratory map showing the distribution of events, journalists regularly combine a dataset with cartographical data. Examples include: Environmental pollution in England mapped: every major and significant pollution incident in since 2001; A People Map of the UK; Interactive map: sewage spills in protected areas
Combining data normally requires that both datasets use at least one measure that can be used to join the two. A column of area codes or names, for example, or institution names, or years (for inflation) in the examples above.
Sometimes you will need to return to the cleaning stage to remove inconsistencies between the same measure in two datasets: institutions or areas might be named slightly differently (one dataset using ‘and’ but the other ‘&’, for example). For this reason it’s always better to match on codes, if the datasets use them, than on names: codes are less likely to be inconsistent across different datasets.
Communicate
In data journalism the all-too-obvious thing to do at this point is to visualise the results on a map, chart, or infographic.
That’s one option, but there are others to consider too, from simply narrating the findings, to humanising it with case studies, interactive personalisation, and tools that empower the audience. In fact there’s so much in this stage alone that I’ve written a separate post (diagram below). Meanwhile, I would very much welcome comments on the pyramid.



s leading numismatic company. com offers its esteemed customers with
the hottest trends in luxury fashion. Vacheron Constantin tour de L’Ile, $1.
Reblogueó esto en milmomentosdecomunicacióny comentado:
La pirámide invertida del periodismo de datos
Reblogged this on Steffi S. Lee.
Flash web designers will like the cost and the creativity of the Trendy Flash
Site Builder. It’s a good idea to have separate email promotions for prospects and customers, too, because you typically need to send different information to the different groups. Html form builder renders efficient service to online business companies to create any sort of online form to integrate it into their website and receive incoming information from online visitors.
Reblogueó esto en lagacetaboricuay comentado:
Descubriendo el periodismo de datos. La Internet es una galaxia de información inimaginable.
You might check the last link in the text. I found a 404 message instead of the post:
https://onlinejournalismblog.wordpress.com/2011/07/13/the-inverted-pyramid-of-data-journalism-part-2-6-ways-of-communicating-data-journalism/
Mark, that link gave me a “Page not found” message also.
Reblogueó esto en Periodismo de datosy comentado:
Add your thoughts here… (optional)
Good description – and the first times I did, I was stunned how much effort all stages take, and how often I had to go back, pull or recalculate because the combination of datasets was not possible due to different scales, dates, definitions. Definitions of what is behind a category in a dataset vs. another one is most treacherous.
I realize the love of journalists with the inverted pyramid (why invert?) but would not be a loop / circle more adequate to show the iterations one has likely to go through?
Paul, I found a number of non-working links. Hopped over to your blog (I’m reading via the Knight MOOC portal) and saw your note about moving the blog to another host. Since most of the errors are failure to establish a database connection, the linked files probably have not been migrated yet. Here are the missing links in the order in which they appear:
‘Blogged draft’ in the first paragraph. (db error)
Link about Google Refine’s ‘common transforms’ function in the Clean section (db error)
‘Where are the cuts hitting hardest?’ in the Combine section (404 Not Found)
‘A postcodes API and Google Refine can help in the Combine section (db error)
‘written a separate post’ in the Communicate section (db error)
I disagree. Your chart is just about collecting, cleaning, aggregating, summarizing, and visualizing data. I believe data-driven journalism is about making a point using data as an argument. However, usually data needs to be augmented by further assumptions and theoretical knowledge. This is where modeling comes into the picture.
So your inverted pyramid shows some skills that are helpful when working with data, but that is only the tip of the iceberg. Understanding and describing the process that generates the data is the true art.
Pingback: Assignment for Thursday 9/12 | - The University of Alabama
Pingback: Data-Driven Journalism MOOC | Civil Statistician
Pingback: 6 ways of communicating data journalism (The inverted pyramid of data journalism part 2) | Online Journalism Blog
Pingback: The inverted pyramid of data journalism | DATOVÁ ŽURNALISTIKA
Pingback: De omgekeerde pyramide van de datajournalistiek (deel 1) - Datajournalistiek.nl
Pingback: Finding Stories in Data | Orwellianisms
Pingback: How to find the data you’re looking for | Reflections on the written word
Pingback: The Don’ts of Storytelling with Numbers | and that's a fact
Pingback: How to cope with a huge flow of data | Tatevik Pirumyan's Personal Space
You’ve written about revising the inverted pyramid structure for data journalism but what about other types of journalism? Does it still work for online journalism articles for example? Will there be another news structure developed for news stories? Particularly as there is more space for journalists to write an article online.
I am interested because I am a final year Communications and Media student at Bournemouth University writing a news feature about ‘the relevance of the inverted pyramid today’ and the way news organisations structure their news stories.
I would like to know more about your thoughts on the topic.
My email address is jennifer.palmer@live.co.uk.
Thank you,
I hope to hear from you soon.
Jennifer Palmer
Pingback: Worry about your data, not who you’re dating | Data Matters
Pingback: Inverted Pyramid of Data Journalism | RU – Data Journalism
Pingback: Week 2 | RU – Data Journalism
Pingback: FAQ: 24 questions about data journalism | Online Journalism Blog
Pingback: In data we generally can’t observe the things we want to measure | Why data analysis are more refreshing than new socks
Pingback: Stories are subject to subjectivity | Genya
Pingback: Week 9: Data Journalism | Journalism Futures
Pingback: Lección 1 – Tu nombre está en Juego
Pingback: Where Big Data Fits Into 24-Hour News
Pingback: Where Big Data Fits Into 24-Hour News – Big Data Analytics
Pingback: Telling Your Data Story Well
Pingback: Data-driven journalism | Acordo Coletivo (Petroleiros, Bancários, Prof de Saúde)
I’m doing my undergrad thesis and this is very helpful. Thank you, Mr. Bradshaw 😊😊
Pingback: Nieuw: de European Communication Awards … inzenden tot 1 mei | Toe Com St
Pingback: The inverted pyramid of data journalism | Parenthesis
Pingback: Bibliografia – Humanidades Digitais
Pingback: The inverted pyramid of data journalism | Online Journalism Blog
Pingback: Por que repórteres de dados devem fazer as cinco perguntas básicas do jornalismo – JViana Marketing
Pingback: Numeracy for Journalists – Data Interactive Journalism