Tag Archives: cleaning

I’ve updated the Inverted Pyramid of Data Journalism — and brought together resources for every stage

What is dirty data and how do I clean it? A great big guide for data journalists

If you’re working with data as a journalist it won’t be long before you come across the phrases “dirty data” or “cleaning data“. The phrases cover a wide range of problems, and a variety of techniques for tackling them, so in this post I’m going to break down exactly what it is that makes data “dirty”, and the different cleaning strategies that a journalist might adopt in tackling them.

Four categories of dirty data problem

Look around for definitions of dirty data and the same three words will crop up: inaccurate, incomplete, or inconsistent.

Dirty data problems:
Inaccurate: Data stored as wrong type; Misentered data; Duplicate data; abbreviation and symbols.
Incomplete: Uncategorised; missing data.
Inconsistent: Inconsistency in naming of entities; mixed data
Incompatible data: Wrong shape;
‘Dirty’ characters (e.g. unescaped HTML)

Inaccurate data includes duplicate or misentered information, or data which is stored as the wrong data type.

Incomplete data might only cover particular periods of time, specific areas, or categories — or be lacking categorisation entirely.

Inconsistent data might name the same entities in different ways or mix different types of data together.

To those three common terms I would also add a fourth: data that is simply incompatible with the questions or visualisation that we want to perform with it. One of the most common cleaning tasks in data journalism, for example, is ‘reshaping‘ data from long to wide, or vice versa, so that we can aggregate or filter along particular dimensions. (More on this later).

Continue reading →

How to: plan a journalism project that needs data entry

6 Replies

Panorama source: FOI sent to 144 councils

This Panorama investigation involved entering data from 144 FOI responses

Data-driven reporting regularly involves some form of data entry — some of the stories I’ve been involved with, for example, have included entering information from Freedom of Information (FOI) requests, compiling data from documents such as companies’ accounts, or working with partners to collect information from a range of sources.

But you’ll rarely hear the challenges of managing these projects discussed in resources on data journalism.

Last week I delivered a session on exactly those challenges to a factchecking team in Albania, so I thought it might be useful to share the tips from that session here.

They include some steps to take to reduce the likelihood of problems arising, while also helping to ensure a data entry project takes as little time as possible. Continue reading →

How to: uncover Excel data only revealed by a drop-down menu

9 Replies

Sometimes an organisation will publish a spreadsheet where only a part of the full data is shown when you select from a drop-down menu. In order to get all the data, you’d have to manually select each option, and then copy the results into a new spreadsheet.

It’s not great.

In this post, I’ll explain some tricks for finding out exactly where the full data is hidden, and how to extract it without getting Repetitive Strain Injury. Here goes…

The example

To get the data from this spreadsheet you have to select 51 different options from a dropdown menu

The spreadsheet I’m using here is pretty straightforward: it’s a list of the populations for each fire and rescue authority in the UK (XLS). These figures are essential for putting any story about fires into context (giving us a per capita figure rather than just whole numbers) — and yet the authority behind the spreadsheet has made it very difficult to extract those numbers. Continue reading →

How to: clean a converted PDF using Open Refine

How to: combine multiple rows in a dataset where text is split across them (Open Refine)

3 Replies

When you’ve converted data from a PDF to a spreadsheet it’s not uncommon for text to end up being split across multiple rows, like this: In this post I’ll explain how you can use Open Refine to quickly clean the data up so that the text is put back together and you have a single row for each entry. Continue reading →

How to: clean up spreadsheet headings that run across multiple rows using Open Refine

3 Replies

Something that infuriates me often with government datasets is the promiscuous heading. This is when a spreadsheet doesn’t just have its headings across one row, but instead splits them across two, three or more rows.

To make matters worse, there are often also extra rows before the headings explaining the spreadsheet more generally. Here’s just one offender from the ONS:

A spreadsheet with promiscuous headings

To clean this up in Excel takes several steps – but Open Refine (formerly Google Refine) does this much more quickly. In this post I’m going to walk through the five minute process there that can save you unnecessary effort in Excel. Continue reading →

Web security for journalists – takeaway tips and review

1 Reply

Web security for journalists - book cover

Early in Alan Pearce‘s book on web security, Deep Web for Journalists, a series of statistics appears that tell a striking story about the spread of surveillance in just one country.

199 is the first: the number of data mining programs in the US in 2004 when 16 Federal agencies were “on the look-out for suspicious activity”.

Just six years later there were 1,200 government agencies working on domestic intelligence programs, and 1,900 private companies working on domestic intelligence programs in the same year.

As a result of this spread there are, notes Pearce, 4.8m people with security clearance “that allows them to access all kinds of personal information”. 1.4m have Top Secret clearance.

But the most sobering figure comes at the end: 1,600 – the number of names added to the FBI’s terrorism watchlist each day.

Predictive policing

This is the world of predictive policing that a modern journalist must operate in: where browsing protesters’ websites, making particular searches, or mentioning certain keywords in your emails or tweets can put you on a watchlist, or even a no-fly list. An environment where it is increasingly difficult to protect your sources – or indeed for sources to trust you.

Alan Pearce’s book attempts to map this world – and outline the myriad techniques to avoid compromising your sources. Continue reading →

A sample dirty dataset for trying out Google Refine

1 Reply

I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:

Subheadings that are only used once (and you need them in each row where they apply)
Odd characters that stand for something else (e.g. a space or ampersand)
Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted

It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).

Thanks to Eva Constantaras for suggesting the idea.

UPDATE: Peter Verweij has put together an introduction to some other cleaning techniques here.

SFTW: 9 data journalism tools

12 Replies

There have been quite a few tools springing up over the past few months that I’ve not had time to blog about, so here’s a roundup post on all of them – a bumper Something For The Weekend (let me know how you find these).

1. Junar – for scraping websites and sharing data

Junar presents a much easier way to scrape data from online tables with its ‘Collect Data‘ tool – and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs. Continue reading →

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.