Tag Archives: DataHarvest

Managing a mass FOI project? Here’s an AI-assisted methodology for that

Sending FOIs to multiple bodies across the country to get the big picture on an issue sounds like a great idea — until the responses start to trickle in. Differences between responses often make mass FOI projects extremely time-consuming as you try to get everything into a format that allows you to ask journalistic questions and compare different authorities. Can AI help?

On one recent project I decided to put together a methodology that made the process less stressful, faster and more accurate. Here’s how it works.

Data structure

Extract & reshape

Check & verify

Combine

Audit & prioritise

Audit responses to identify the level of detail in each response and identify edge cases. Include a caveats column.
Augment manual audit with NotebookLM audit.
Identify a priority order for data, e.g. totals by outcome, hospital, category or year where these are provided separately


Design a data structure that can accommodate all responses
Structure should follow ‘tidy’ data principles, i.e. one row per combination of features (force, category, hospital, outcome, year)
Structure should include source details, e.g. filename, sheet name, name of person entering data


PDFs: use Tabula or 
vibe coding (design a prompt template to generate code to attempt to extract data). Multi-sheet XLS files: use Open Refine to import and combine sheets
Design a prompt template for generating code to reshape CSV responses


Manual checks (e.g. compare entries, check page-ending rows)
Analysis-based checks (e.g pivots, totals)
AI-based checks using a prompt template (e.g. compare files)


Use OpenRefine or: Design a prompt template for generating code to combine the resulting CSV files.
Continue reading