
In the fourth of a series of posts from a workshop at the Centre for Investigative Journalism Summer School (the first part covered idea generation; the second research; the third spreadsheets), I look at using generative AI tools such as ChatGPT and Google Gemini to help with scraping.
One of the most common reasons a journalist might need to learn to code is scraping: compiling information from across multiple webpages, or from one page across a period of time.
But scraping is tricky: it requires time learning some coding basics, and then further time learning how to tackle the particular problems that a specific scraping task involves. If the scraping challenge is anything but simple, you will need help to overcome trickier obstacles.
Large language models (LLMs) like ChatGPT are especially good at providing this help because writing code is a language challenge, and material about coding makes up a significant amount of the material that these models have been trained on.
This can make a big difference in learning to code: in the first year that I incorporated ChatGPT into my data journalism Masters at Birmingham City University I noticed that students were able to write more advanced scrapers earlier than previously — and also that students were less likely to abandon their attempts at coding.
You can also start scraping pretty quickly with the right prompts (Google Colab allows you to run Python code within Google Drive). Here are some tips on how to do so…
Continue reading