An introduction to data scraping with Scraperwiki

Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms.

It’s pretty good.

Why screen scraping is useful for journalists

Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things:

Getting information from somewhere
Storing it somewhere that you can get to it later
And in a form that makes it easy (or easier) to analyse and interrogate

So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.

But those are just the initial aspects of screen scraping. Screen scraping tools like Scraperwiki or scripts you might write yourself offer further benefits that are also worth outlining:

Scheduling a scraper to run at regular intervals (Adrian Holovaty compares this to making regular virtual trips to the local police station)
Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to metres)
Visualising data (for example as a chart, or on a map)
Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)

If you can think of any more, let me know.

How Scraperwiki works

Scraperwiki is not the only screen scraping tool out there. In fact, you can do simple scraping with Google Spreadsheets, the OutWit Firefox extension, or Yahoo! Pipes, to name just a few. And if you’ve never done scraping before, those are probably better places to start.

Scraperwiki is probably the next step up from those – giving you extra functionality and flexibility above and beyond merely scraping data to a spreadsheet.

The catch is that you will need to understand programming – not necessarily to be able to write it from scratch, but to be able to look at programming and make some educated guesses about ways to edit it to bring about a different result.

But then, as a journalist, you should be more than used to rewriting material to suit a particular objective – the skill is the same, right? Think of it as programming churnalism.

Even if you don’t understand programming, the site provides a range of tutorials to show you how it works – and it’s a good place to learn some basic programming even if you never use it to write a scraper, particularly as you can look at and adapt other scrapers, or find others to talk to about the process.

The biggest attraction for me of the site is the fact that you don’t have to fiddle around with setting up the programming environment that makes your code work – a particularly big hurdle to get over if you’re programming from scratch.

Of course, the more you understand about programming, the more you will be able to do – even to the extent of writing code from scratch. But remember that part of the skill of programming is being able to find code from elsewhere instead of having to write it all yourself. It’s about standing on the shoulders of giants as much as being a great Romantic original. Journalists could learn a lot from that ethos.

What else Scraperwiki does

If you want some data scraped, and don’t have the time or desire to learn how to write one, you can set a bounty for someone else to do it. You can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

It is also a data repository – so even if you never scrape anything yourself, it’s worth subscribing to the RSS feed of the latest scrapers.

In a future post I’ll try to pick apart the code of a web scraper written in Python. But for now, if you have a free evening, have a play with the tutorials yourself.