Tag Archives: AI

What is dirty data and how do I clean it? A great big guide for data journalists

Image: George Hodan

If you’re working with data as a journalist it won’t be long before you come across the phrases “dirty data” or “cleaning data“. The phrases cover a wide range of problems, and a variety of techniques for tackling them, so in this post I’m going to break down exactly what it is that makes data “dirty”, and the different cleaning strategies that a journalist might adopt in tackling them.

Four categories of dirty data problem

Look around for definitions of dirty data and the same three words will crop up: inaccurate, incomplete, or inconsistent.

Dirty data problems:
Inaccurate: Data stored as wrong type; Misentered data; Duplicate data; abbreviation and symbols.
Incomplete: Uncategorised; missing data.
Inconsistent: Inconsistency in naming of entities; mixed data
Incompatible data:  Wrong shape;
‘Dirty’ characters (e.g. unescaped HTML)

Inaccurate data includes duplicate or misentered information, or data which is stored as the wrong data type.

Incomplete data might only cover particular periods of time, specific areas, or categories — or be lacking categorisation entirely.

Inconsistent data might name the same entities in different ways or mix different types of data together.

To those three common terms I would also add a fourth: data that is simply incompatible with the questions or visualisation that we want to perform with it. One of the most common cleaning tasks in data journalism, for example, is ‘reshaping‘ data from long to wide, or vice versa, so that we can aggregate or filter along particular dimensions. (More on this later).

Continue reading

Here are some great examples of how to use AI and satellite imagery in journalism

False colour image of the Paraná River near its mouth at the Rio de La Plata, Argentina
False colour image of the Paraná River near its mouth at the Rio de La Plata, Argentina. Image: Copernicus Sentinel data [2022] processed by Sentinel Hub.

In a guest post for OJB, first published on ML Satellites, MA Data Journalism student Federico Acosta Rainis explains what can be learned from some examples of the format.

Satellite imagery is increasingly a key asset for journalists. Looking from above often allows us to put a story into context, take a more interesting perspective or show what some power prefers to keep hidden.

But with hundreds of satellites taking thousands of images of the Earth every day, it is difficult to separate the wheat from the chaff. How can we find relevant stories in this ocean of data?

Continue reading

What stories can you tell using AI and satellite imagery? Here are some ideas

In the second of two guest posts for OJB, first published on the ML Satellites blog, MA Data Journalism student Federico Acosta Rainis uses the 8 angles used by data journalists framework to explore satellite image-driven journalism.

Satellite-driven stories don’t have to use using artificial intelligence (AI) — many can be told using satellite data alone, without. The main advantages of AI include quantifying phenomena, identifying patterns, showing changes or finding a “needle in a haystack” across large territories or different time periods.

AI algorithms can also be used to automate a process: since satellites produce recurring data, you can build, for example, a platform that automatically detects changes in the size of forests.

Paul Bradshaw’s framework for data journalism angles recognises eight types of stories: scale, change, ranking, variation, exploration, exploration, relationships, stories about data and stories through data. The same framework can be adopted to generate ideas for satellite journalism, too.

Continue reading

Journalism, AI and satellite imagery: how to get started

Satellite image of the Amazon. Tocantins, Brazil. Source: Copernicus Sentinel data [2022] processed by Sentinel Hub, using Highlight Optimized Natural Color.

In the first of two guest posts for OJB, first published on ML Satellites, MA Data Journalism student Federico Acosta Rainis explains how to get started with satellite journalism — and avoid common pitfalls.

Working with satellite imagery and AI models takes time and patience. There is no general rule: you have to find the right model for each case, in a process of trial and error, while crunching large amounts of data.

That is why the advice of Anatoly Bondarenko, data editor of Texty, is crucial:

Continue reading

GEN 2019 round-up: 4 videos to watch on the potential of data and AI

Krishna Bharat

This year’s Global Editor’s Network (GEN) Summit, in Athens, Greece, had a big focus on the use of verification and automation. BBC News data scientist and PGCert Data Journalism student Alison Benjamin went along to see what was being said about artificial intelligence (AI), data and technology in the news industry. Here are her highlights…
Continue reading

If we are using AI in journalism we need better guidelines on reporting uncertainty

Chart: women speak 27% of the time in Game of Thrones

The BBC’s chart mentions a margin of error

There’s a story out this week on the BBC website about dialogue and gender in Game of Thrones. It uses data generated by artificial intelligence (AI) — specifically, machine learning —  and it’s a good example of some of the challenges that journalists are increasingly going to face as they come to deal with more and more algorithmically-generated data.

Information and decisions generated by AI are qualitatively different from the sort of data you might find in an official report, but journalists may fall back on treating data as inherently factual.

Here, then, are some of the ways the article dealt with that — and what else we can do as journalists to adapt.

Margins of error: journalism doesn’t like vagueness

The story draws on data from an external organisation, Ceretai, which “uses machine learning to analyse diversity in popular culture.” The organisation claims to have created an algorithm which “has learned to identify the difference between male and female voices in video and provides the speaking time lengths in seconds and percentages per gender.”

Crucially, the piece notes that:

“Like most automatic systems, it doesn’t make the right decision every time. The accuracy of this algorithm is about 85%, so figures could be slightly higher or lower than reported.”

And this is the first problem. Continue reading

GEN Summit: AI’s breakthrough year in publishing

This week’s GEN Summit marked a breakthrough moment for artificial intelligence (AI) in the media industry. The topic dominated the agenda of the first two days of the conference, from Facebook’s Antoine Bordes opening keynote to voice AI, bots, monetisation and verification – and it dominated my timeline too.

At times it felt like being at a conference in the 1980s discussing how ‘computers’ could be used in the newsroom, or listening to people talking about the use of mobile phones for journalism in the noughties — in other words, it feels very much like early days. But important days nonetheless.

Ludovic Blecher‘s slide on the AI-related projects that received Google Digital News Initiative funding illustrated the problem best, with proposals counted in categories as specific as ‘personalisation’ and as vague as ‘hyperlocal’.

Digging deeper, then, here are some of the most concrete points I took away from Lisbon — and what journalists and publishers can take from those.

Continue reading

This is what I learned after teaching chatbots to journalists: 3 takeaways for newsrooms

In a guest post for OJB Maria Crosas points out three main takeaways that newsrooms should consider when aiming for a complete chatbot experience. 

Over the past year I’ve been frequently invited to share ideas around how bots can help newsrooms to deliver news, and advice on how to build an engaging chatbot experiences. And throughout these classes, I’ve also had challenging questions on how these technologies are pushing the boundaries of ethics, artificial intelligence and storytelling.

I’ve boiled down these experiences into 3 takeaways for newsrooms that want to begin the chatbot journey. Here they are…

Continue reading

Data journalism’s AI opportunity: the 3 different types of machine learning & how they have already been used

I understand that you want me to explain how Ava works (from Ex Machina)

This week I’m rounding off the first semester of classes on the new MA in Data Journalism with a session on artificial intelligence (AI) and machine learning. Machine learning is a subset of AI — and an area which holds enormous potential for journalism, both as a tool and as a subject for journalistic scrutiny.

So I thought I would share part of the class here, showing some examples of how the 3 types of machine learning — supervised, unsupervised, and reinforcement — have already been used for journalistic purposes, and using those to explain what those are along the way. Continue reading

Data journalism in broadcast news and video: 27+ examples to inspire and educate

channel 4 network diagram

This network diagram comes from a Channel 4 News story

The best-known examples of data journalism tend to be based around text and visuals — but it’s harder to find data journalism in video and audio. Ahead of the launch of my new MA in Data Journalism I thought I would share my list of the examples of video data journalism that I use with students in exploring data storytelling across multiple platforms. If you have others, I’d love to hear about them.

FOI stories in broadcast journalism

victoria derbyshire gif

Freedom of Information stories are one of the most common situations when broadcasters will have to deal with more in-depth data. These are often brought to life by through case studies and interviewing experts. Continue reading