Author Archives: Paul Bradshaw

About Paul Bradshaw

Paul teaches data journalism at Birmingham City University and is the author of a number of books and book chapters about online journalism and the internet, including the Online Journalism Handbook, Mobile-First Journalism, Finding Stories in Spreadsheets, Data Journalism Heist and Scraping for Journalists. From 2010-2015 he was a Visiting Professor in Online Journalism at City University London and from 2009-2014 he ran Help Me Investigate, an award-winning platform for collaborative investigative journalism. Since 2015 he has worked with the BBC England and BBC Shared Data Units based in Birmingham, UK. He also advises and delivers training to a number of media organisations.

What is dirty data and how do I clean it? A great big guide for data journalists

7 Replies

If you’re working with data as a journalist it won’t be long before you come across the phrases “dirty data” or “cleaning data“. The phrases cover a wide range of problems, and a variety of techniques for tackling them, so in this post I’m going to break down exactly what it is that makes data “dirty”, and the different cleaning strategies that a journalist might adopt in tackling them.

Four categories of dirty data problem

Look around for definitions of dirty data and the same three words will crop up: inaccurate, incomplete, or inconsistent.

Dirty data problems:
Inaccurate: Data stored as wrong type; Misentered data; Duplicate data; abbreviation and symbols.
Incomplete: Uncategorised; missing data.
Inconsistent: Inconsistency in naming of entities; mixed data
Incompatible data: Wrong shape;
‘Dirty’ characters (e.g. unescaped HTML)

Inaccurate data includes duplicate or misentered information, or data which is stored as the wrong data type.

Incomplete data might only cover particular periods of time, specific areas, or categories — or be lacking categorisation entirely.

Inconsistent data might name the same entities in different ways or mix different types of data together.

To those three common terms I would also add a fourth: data that is simply incompatible with the questions or visualisation that we want to perform with it. One of the most common cleaning tasks in data journalism, for example, is ‘reshaping‘ data from long to wide, or vice versa, so that we can aggregate or filter along particular dimensions. (More on this later).

Continue reading →

Angles for data stories — in Finnish (yleistä näkökulmaa datatarinoihin)

Leave a reply

I recently had the opportunity — thanks to Esa Sirkkunen of Tampere University — to translate the diagram from ‘8 angles that journalists use most often to tell data stories‘ into Finnish. I’m sharing it here for anyone else who might find it useful.

8 yleistä näkökulmaa datatarinoihin
Mittakaava
Muutos
Sijoitus
Variaatio
Tutkia
Suhteet
Puuttuva/huono
Johtaa

FOI, diversity, and imposter syndrome — an interview with Jenna Corderoy

Leave a reply

Lyra McKee Memorial Lecture - Tuesday March 28 5.30-7pm at Birmingham City University Curzon Building. Jenna Corderoy on holding power to account and getting into journalism.

On Tuesday I will be hosting the award-winning investigative journalist and FOI campaigner Jenna Corderoy at the Lyra McKee Memorial Lecture. Ahead of the event, I asked Jenna about her tips on investigations, FOI, confidence, and the challenges facing the industry.

What’s the story you have learned the most from?

The story that I learned the most from was definitely our Clearing House investigation. Back in November 2020, we revealed the existence of a unit within the heart of government, which screened Freedom of Information (FOI) requests and instructed government departments on how to respond to requests. The unit circulated the names of requesters across Whitehall, notably the names of journalists.

Continue reading →

VIDEO PLAYLIST: An introduction to Python for data journalism and scraping

2 Replies

Python is an extremely powerful language for journalists who want to scrape information from online sources. This series of videos, made for students on the MA in Data Journalism at Birmingham City University, explains some core concepts to get started in Python, how to use Colab notebooks within Google Drive, and introduces some code to get started with scraping.

Continue reading →

Availability bias: a guide for journalists

Leave a reply

Diagram showing a large circle labelled 'All the information about a subject' and a smaller circle within that labelled 'The information that is easiest to recall'

I’ve written previously about the role that cognitive biases play in journalism, how to avoid confirmation bias, and anticipate criticism based on fallacies — but one cognitive bias I haven’t written about yet is the availability heuristic — or availability bias.

Availability bias is the tendency to reach for the most available reason, event, or tool, when confronted with a problem or decision.

Continue reading →

Here’s how the ‘8 data story angles’ can help you get stories from company accounts

1 Reply

8 common angles for accounts stories
Scale: of profit/loss, of bonuses, payoffs, cuts
Change/stasis: profit/loss/bonuses going up/down
Outliers/ranking: based on any metric
Variation: within a sector
Exploration: a company structure; a director; payments
Relationships: mapping a corporate network or director’s interests
Bad data: Undeclared interests
Leads: Background, conflicts of interest, factchecks

A couple of years ago I mapped out eight common angles for identifying stories in data. It turns out that the same framework is useful for finding stories in company accounts, too — but not only that: the angles also map neatly onto three broad techniques.

In this post I’ll go through each of the three techniques — looking at cash flow statements; compiling data from multiple accounts; and tracing people and connections — and explain how they can be used to get stories, with examples of articles that have used those techniques successfully.

We start, naturally, with the money…

Continue reading →

9 способов найти историю в финансовых отчётах компаний

Leave a reply

Моя статья на русском здесь.

🔦 9 способов найти историю в финансовых отчётах компаний@paulbradshaw вместе со своими студентами собрал примеры, в которых чтение нужной страницы отчёта помогло быстро подготовить действительно интересный расследовательский материал👇 https://t.co/YKwvdkJxzh
— GIJN – Глобальная сеть журналистов-расследователей (@gijnRu) February 1, 2023

This is a masterclass in writing a story about company directors’ pay — so I reverse-engineered it

Leave a reply

Owner of UK care home group paid himself £21m despite safety concerns

Company directors’ pay regularly provides material for stories — and this front page story by The Guardian’s Robert Booth was such a masterclass in the genre (as well as other open source intelligence techniques) that I decided to reverse-engineer it for a Twitter thread.

I’ve embedded the thread below, or you can read it on Threadreader here.

🧵 It’s time for another roller-coaster thread digging into how one journalist has used company accounts* to get a great story.
This time it's a front page story by @Robert_Booth https://t.co/yFi4qH5IBJ
*Featuring: other useful open sources
— Paul Bradshaw (@paulbradshaw) January 23, 2023

Using company accounts in journalism

You can find other posts about using company accounts at the following links:

How ‘triangulating’ can help you identify more sources

Leave a reply

Triangulating journalistic sources: diagram showing that each of the three types of sources (people, data and documents) leads to the other two types of sources

In this edited extract from the forthcoming third edition of the Online Journalism Handbook I look at how a ‘triangulation’ approach to sourcing can help broaden story research and improve reporting.

Two centuries ago journalists were called reporters because they drew their information from official reports — documents.

Then in the late 19th century a new source became part of journalistic practice: people, as interviews and eyewitness accounts were added to news articles.

The late 20th century saw reporting undergo another expansion in sourcing, as digital data was added to the journalist’s toolkit.

Although reports had included tables and other sources of data, the properties of digital data — filterable, sortable and searchable — have been significant, and make data a qualitatively different type of source.

How documents, people and data all lead to each other

Considering sourcing along those three dimensions — people, documents, and data — can be particularly useful when planning sourcing.

Continue reading →

Defending an investigation — and planning one: lessons from ProPublica’s Black Snow

1 Reply

Sugar Companies Said Our Investigation Is Flawed and Biased. Let’s Dive Into Why That’s Not the Case.

In the summer of last year ProPublica published a major investigation into air pollution in Florida, and its connection to the sugar industry. The story itself, Black Snow, is an inspiring example of scrollytelling — but equally instructive is the methodology article which accompanies it, responding to criticisms from the sugar industry.

Not only does it demonstrate how to respond when large organisations attack a piece of journalism — it also provides a great lesson on the tactics that are adopted by organisations when attacking data-driven stories.

In this post I want to break down the three most common attack tactics, how ProPublica deal with two of those, and how to use the same tactics during planning to ensure your project design isn’t flawed.

Continue reading →

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

Author Archives: Paul Bradshaw

About Paul Bradshaw

What is dirty data and how do I clean it? A great big guide for data journalists

Four categories of dirty data problem

Angles for data stories — in Finnish (yleistä näkökulmaa datatarinoihin)

FOI, diversity, and imposter syndrome — an interview with Jenna Corderoy

What’s the story you have learned the most from?

VIDEO PLAYLIST: An introduction to Python for data journalism and scraping

Availability bias: a guide for journalists

Here’s how the ‘8 data story angles’ can help you get stories from company accounts

9 способов найти историю в финансовых отчётах компаний

This is a masterclass in writing a story about company directors’ pay — so I reverse-engineered it

Using company accounts in journalism

How ‘triangulating’ can help you identify more sources

How documents, people and data all lead to each other

Defending an investigation — and planning one: lessons from ProPublica’s Black Snow