Category Archives: computer aided reporting

How to ask AI to perform data analysis

Consider the model: Some models are better for analysis — check it has run code

Name specific columns and functions: Be explicit to avoid ‘guesses’ based on your most probably meaning

Design answers that include context: Ask for a top/bottom 10 instead of just one answer

'Ground' the analysis with other docs: Methodologies, data dictionaries, and other context

Map out a method using CoT: Outline the steps needed to be taken to reduce risk

Use prompt design techniques to avoid gullibility and other risks: N-shot prompting (examples), role prompting, negative prompting and meta prompting can all reduce risk

Anticipate conversation limits: Regularly ask for summaries you can carry into a new conversation

Export data to check: Download analysed data to check against the original

Ask to be challenged: Use adversarial prompting to identify potential blind spots or assumptions

In a previous post I explored how AI performed on data analysis tasks — and the importance of understanding the code that it used to do so. If you do understand code, here are some tips for using large language models (LLMs) for analysis — and addressing the risks of doing so.

Continue reading

I tested AI tools on data analysis — here’s how they did (and what to look out for)

Mug with 'Data or it didn't happen' on it
Photo: Jakub T. Jankiewicz | CC BY-SA 2.0

TL;DR: If you understand code, or would like to understand code, genAI tools can be a useful tool for data analysis — but results depend heavily on the context you provide, and the likelihood of flawed calculations mean code needs checking. If you don’t understand code (and don’t want to) — don’t do data analysis with AI.

ChatGPT used to be notoriously bad at maths. Then it got worse at maths. And the recent launch of its newest model, GPT-5, showed that it’s still bad at maths. So when it comes to using AI for data analysis, it’s going to mess up, right?

Well, it turns out that the answer isn’t that simple. And the reason why it’s not simple is important to explain up front.

Generative AI tools like ChatGPT are not calculators. They use language models to predict a sequence of words based on examples from its training data.

But over the last two years AI platforms have added the ability to generate and run code (mainly Python) in response to a question. This means that, for some questions, they will try to predict the code that a human would probably write to solve your question — and then run that code.

When it comes to data analysis, this has two major implications:

  1. Responses to data analysis questions are often (but not always) the result of calculations, rather than a predicted sequence of words. The algorithm generates code, runs that code to calculate a result, then incorporates that result into a sentence.
  2. Because we can see the code that performed the calculations, it is possible to check how those results were arrived at.
Continue reading

Die umgekehrte Pyramide des Datenjournalismus: Vom Datensatz zur Story

Die umgekehrte Pyramide des Datenjournalismus
Ideen entwickeln
Daten sammeln
Reinigen
Kontextualisieren
Kombinieren
Fragen
Kommunizieren

Datenjournalistische Projekte lassen sich in einzelne Schritte aufteilen – jeder einzelne Schritt bringt eigene Herausforderungen. Um dir zu helfen, habe ich die “Umgekehrte Pyramide des Datenjournalismusentwickelt. Sie zeigt, wie du aus einer Idee eine fokussierte Datengeschichte machst. Ich erkläre dir Schritt für Schritt, worauf du achten solltest, und gebe dir Tipps, wie du typische Stolpersteine vermeiden kannst.

(Auch auf Englisch, Spanisch, Finnisch, Russisch and Ukrainisch verfügbar.)

Continue reading

9 takeaways from the Data Journalism UK conference

Attendees in a lecture theatre with 'data and investigative journalism conference 2025 BBC Shared Data Unit' on the screen.

Last month the BBC’s Shared Data Unit held its annual Data and Investigative Journalism UK conference at the home of my MA in Data Journalism, Birmingham City University. Here are some of the highlights…

Continue reading

How do I get data if my country doesn’t publish any?

Spotlight photo by Paul Green on Unsplash

In many countries public data is limited, and access to data is either restricted, or information provided by the authorities is not credible. So how do you obtain data for a story? Here are some techniques used by reporters around the world.

Continue reading

Making video and audio interviews searchable: how Pinpoint helped with one investigation

Pinpoint creates a ranking of people, organisations, and locations with the number of times they are mentioned on your uploaded documents.

MA Data Journalism student Tony Jarne spent eight months investigating exempt accommodation, collecting hundreds of documents, audio and video recordings along the way. To manage all this information, he turned to Google’s free tool Pinpoint. In a special guest post for OJB, he explains how it should be an essential part of any journalist’s toolkit.

The use of exempt accommodation — a type of housing for vulnerable people — has rocketed in recent years.

At the end of December, a select committee was set up in Parliament to look into the issue. The select committee opened a deadline, and anyone who wished to do so could submit written evidence.

Organisations, local authorities and citizens submitted more than 125 pieces of written evidence to be taken into account by the committee. Some are only one page — others are 25 pages long.

In addition to the written evidence, I had various reports, news articles, Land Registry titles an company accounts downloaded from Companies House.

I needed a tool to organise all the documentation. I needed Pinpoint

Continue reading

Investigating the World Cup: tips on making FOIA requests to create a data-driven news story

Image by Ambernectar 13

Beatriz Farrugia used Brazil’s freedom of information laws to investigate the country’s hosting of the World Cup. In a special guest post for OJB, the Brazilian journalist and former MA Data Journalism student passes on some of her tips for using FOIA.

I am from Brazil, a country well-known for football and FIFA World Cup titles — and the host of the World Cup in 2014. Being a sceptical journalist, in 2019 I tried to discover the real impacts of that 2014 World Cup on the 213 million residents of Brazil: tracking the 121 infrastructure projects that the Brazilian government carried out for the competition and which were considered the “major social legacy” of the tournament.  

In 2018 the Brazilian government had taken the website and official database on the 2014 FIFA World Cup infrastructure projects offline — so I had to make Freedom of Information (FOIA) requests to get data.

The investigation took 3 months and more than 230 FOIA requests to 33 different public bodies in Brazil. On August 23, my story was published.

Here is everything that I have learned from making those hundreds of FOIA requests:

Continue reading

How to: clean a converted PDF using Open Refine

Our initial table

This spreadsheet sent in response to an FOI request appeared to have been converted from PDF format

In a guest post post for OJB, Ion Mates explains how he used OpenRefine to clean up a spreadsheet which had been converted from PDF format. An earlier version of this post was published on his blog.

Journalists rarely get their hands on nice, tidy data: public bodies don’t have an interest in providing information in a structured form. So it is increasingly part of a journalist’s job to get that information into the right state before extracting patterns and stories.

A few months ago I sent a Freedom of Information request asking for the locations of all litter bins in Birmingham. But instead of sending a spreadsheet downloaded directly from their database, the spreadsheet they sent appeared to have been converted from a multiple-page PDF.

This meant all sorts of problems, from rows containing page numbers and repeated header rows, to information split across multiple rows and even pages.

In this post I’ll be taking you through I used the free data cleaning tool OpenRefine (formerly Google Refine) to tackle all these problems and create a clean version of the data. Continue reading