In a previous post I explored how AI performed on data analysis tasks — and the importance of understanding the code that it used to do so. If you do understand code, here are some tips for using large language models (LLMs) for analysis — and addressing the risks of doing so.
TL;DR: If you understand code, or would like to understand code, genAI tools can be a useful tool for data analysis — but results depend heavily on the context you provide, and the likelihood of flawed calculations mean code needs checking. If you don’t understand code (and don’t want to) — don’t do data analysis with AI.
ChatGPT used to be notoriously bad at maths. Then it got worse at maths. And the recent launch of its newest model, GPT-5, showed that it’s still bad at maths. So when it comes to using AI for data analysis, it’s going to mess up, right?
Well, it turns out that the answer isn’t that simple. And the reason why it’s not simple is important to explain up front.
But over the last two years AI platforms have added the ability to generate and run code (mainly Python) in response to a question. This means that, for some questions, they will try to predict the code that a human would probably write to solve your question — and then run that code.
When it comes to data analysis, this has two major implications:
Responses to data analysis questions are often (but not always) the result of calculations, rather than a predicted sequence of words. The algorithm generates code, runs that code to calculate a result, then incorporates that result into a sentence.
Because we can see the code that performed the calculations, it is possible to checkhow those results were arrived at.
Datenjournalistische Projekte lassen sich in einzelne Schritte aufteilen – jeder einzelne Schritt bringt eigene Herausforderungen. Um dir zu helfen, habe ich die “Umgekehrte Pyramide des Datenjournalismus” entwickelt. Sie zeigt, wie du aus einer Idee eine fokussierte Datengeschichte machst. Ich erkläre dir Schritt für Schritt, worauf du achten solltest, und gebe dir Tipps, wie du typische Stolpersteine vermeiden kannst.
Last month the BBC’s Shared Data Unit held its annual Data and Investigative Journalism UK conference at the home of my MA in Data Journalism, Birmingham City University. Here are some of the highlights…
In many countries public data is limited, and access to data is either restricted, or information provided by the authorities is not credible. So how do you obtain data for a story? Here are some techniques used by reporters around the world.
Pinpoint creates a ranking of people, organisations, and locations with the number of times they are mentioned on your uploaded documents.
MA Data Journalism student Tony Jarne spent eight months investigating exempt accommodation, collecting hundreds of documents, audio and video recordings along the way. To manage all this information, he turned to Google’s free tool Pinpoint. In a special guest post for OJB, he explains how it should be an essential part of any journalist’s toolkit.
The use of exempt accommodation — a type of housing for vulnerable people — has rocketed in recent years.
At the end of December, a select committee was set up in Parliament to look into the issue. The select committee opened a deadline, and anyone who wished to do so could submit written evidence.
Organisations, local authorities and citizens submitted more than 125 pieces of written evidence to be taken into account by the committee. Some are only one page — others are 25 pages long.
In addition to the written evidence, I had various reports, news articles, Land Registry titles an company accounts downloaded from Companies House.
I needed a tool to organise all the documentation. I needed Pinpoint.
I am from Brazil, a country well-known for football and FIFA World Cup titles — and the host of the World Cup in 2014. Being a sceptical journalist, in 2019 I tried to discover the real impacts of that 2014 World Cup on the 213 million residents of Brazil: tracking the 121 infrastructure projects that the Brazilian government carried out for the competition and which were considered the “major social legacy” of the tournament.
In 2018 the Brazilian government had taken the website and official database on the 2014 FIFA World Cup infrastructure projects offline — so I had to make Freedom of Information (FOIA) requests to get data.
The investigation took 3 months and more than 230 FOIA requests to 33 different public bodies in Brazil. On August 23, my story was published.
Here is everything that I have learned from making those hundreds of FOIA requests:
Journalists rarely get their hands on nice, tidy data: public bodies don’t have an interest in providing information in a structured form. So it is increasingly part of a journalist’s job to get that information into the right state before extracting patterns and stories.
A few months ago I sent a Freedom of Information request asking for the locations of all litter bins in Birmingham. But instead of sending a spreadsheet downloaded directly from their database, the spreadsheet they sent appeared to have been converted from a multiple-page PDF.
This meant all sorts of problems, from rows containing page numbers and repeated header rows, to information split across multiple rows and even pages.
In this post I’ll be taking you through I used the free data cleaning tool OpenRefine (formerly Google Refine) to tackle all these problems and create a clean version of the data. Continue reading →