Tag Archives: Hadley Wickham

I’ve updated the Inverted Pyramid of Data Journalism — and brought together resources for every stage

Inverted pyramid of data journalism: conceive, compile, clean, context, combine (with 'question' throughout). Communicate: vis, narrate, humanise, personalise, socialise, utilise

It’s over a decade since I published the Inverted Pyramid of Data Journalism. The model has been translated into multiple languages, taught all over the world, and included in a number of books and research papers. But in that time the model has also developed and changed through discussion and teaching, so here’s a round up of everything I’ve written or recommended on the different stages — along with a revised model in English (shown above; versions have been published before in German, Russian and Ukrainian!).

The most basic change to the Inverted Pyramid of Data Journalism is the recognition of a stage that precedes all others — idea generation — labelled ‘Conceive’ in the diagram above.

This is often a major stumbling block to people starting out with data journalism, and I’ve written a lot about it in recent years (see below for a full list).

The second major change is to make questioning more explicit as a process that (should) take place through all stages — not just in data analysis but in the way we question our sources, our ideas, and the reliability of the data itself.

Alongside the updated pyramid I’ve been using for the past few years I also wanted to round up links to a number of resources that relate to each stage. Here they are…

Continue reading

What is dirty data and how do I clean it? A great big guide for data journalists

Image: George Hodan

If you’re working with data as a journalist it won’t be long before you come across the phrases “dirty data” or “cleaning data“. The phrases cover a wide range of problems, and a variety of techniques for tackling them, so in this post I’m going to break down exactly what it is that makes data “dirty”, and the different cleaning strategies that a journalist might adopt in tackling them.

Four categories of dirty data problem

Look around for definitions of dirty data and the same three words will crop up: inaccurate, incomplete, or inconsistent.

Dirty data problems:
Inaccurate: Data stored as wrong type; Misentered data; Duplicate data; abbreviation and symbols.
Incomplete: Uncategorised; missing data.
Inconsistent: Inconsistency in naming of entities; mixed data
Incompatible data:  Wrong shape;
‘Dirty’ characters (e.g. unescaped HTML)

Inaccurate data includes duplicate or misentered information, or data which is stored as the wrong data type.

Incomplete data might only cover particular periods of time, specific areas, or categories — or be lacking categorisation entirely.

Inconsistent data might name the same entities in different ways or mix different types of data together.

To those three common terms I would also add a fourth: data that is simply incompatible with the questions or visualisation that we want to perform with it. One of the most common cleaning tasks in data journalism, for example, is ‘reshaping‘ data from long to wide, or vice versa, so that we can aggregate or filter along particular dimensions. (More on this later).

Continue reading

FAQ: Books to read in preparation for doing a data journalism course

This is what you’ll look like after reading all of these books… (“Study of a Man Reading” by Alphonse Legros)

This latest in the frequently asked questions series is an answer to an aspiring data journalism student who asks “Would you be able to direct me to any resources or text books that might help [prepare]?” Here are some recommendations I give to students on my MA in Data Journalism

Books on data journalism as a profession

Data journalism isn’t just the application of a practical skill, but a profession with a culture, a history, and non-technical practices.

For that reason probably the first thing to recommend is not a book, but just general reading (and listening and watching) as much data journalism, and journalism generally, as possible. These mailing lists (and these) are a good start, and following data journalists on Twitter, and the hashtag #ddj, will expose you to the debates taking place in the industry. Continue reading