This was originally published on Journalism.co.uk – cross-posted here for convenience.
Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I love scraping: it is both a great time-saver, and a great source of stories no one else has.
Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).
In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.
ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”
The problem is that most people imagine that you need to learn a programming language to start scraping – but that’s not true. It can help – especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.
I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).
And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.
What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.
I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.