An introduction to data scraping with Scraperwiki

Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms.

It’s pretty good.

Why screen scraping is useful for journalists

Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things:

  • Getting information from somewhere
  • Storing it somewhere that you can get to it later
  • And in a form that makes it easy (or easier) to analyse and interrogate

So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.

But those are just the initial aspects of screen scraping. Screen scraping tools like Scraperwiki or scripts you might write yourself offer further benefits that are also worth outlining:

  • Scheduling a scraper to run at regular intervals (Adrian Holovaty compares this to making regular virtual trips to the local police station)
  • Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to metres)
  • Visualising data (for example as a chart, or on a map)
  • Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)

If you can think of any more, let me know.

How Scraperwiki works

Scraperwiki is not the only screen scraping tool out there. In fact, you can do simple scraping with Google Spreadsheets, the OutWit Firefox extension, or Yahoo! Pipes, to name just a few. And if you’ve never done scraping before, those are probably better places to start.

Scraperwiki is probably the next step up from those – giving you extra functionality and flexibility above and beyond merely scraping data to a spreadsheet.

The catch is that you will need to understand programming – not necessarily to be able to write it from scratch, but to be able to look at programming and make some educated guesses about ways to edit it to bring about a different result.

But then, as a journalist, you should be more than used to rewriting material to suit a particular objective – the skill is the same, right? Think of it as programming churnalism.

Even if you don’t understand programming, the site provides a range of tutorials to show you how it works – and it’s a good place to learn some basic programming even if you never use it to write a scraper, particularly as you can look at and adapt other scrapers, or find others to talk to about the process.

The biggest attraction for me of the site is the fact that you don’t have to fiddle around with setting up the programming environment that makes your code work – a particularly big hurdle to get over if you’re programming from scratch.

Of course, the more you understand about programming, the more you will be able to do – even to the extent of writing code from scratch. But remember that part of the skill of programming is being able to find code from elsewhere instead of having to write it all yourself. It’s about standing on the shoulders of giants as much as being a great Romantic original. Journalists could learn a lot from that ethos.

What else Scraperwiki does

If you want some data scraped, and don’t have the time or desire to learn how to write one, you can set a bounty for someone else to do it. You can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

It is also a data repository – so even if you never scrape anything yourself, it’s worth subscribing to the RSS feed of the latest scrapers.

In a future post I’ll try to pick apart the code of a web scraper written in Python. But for now, if you have a free evening, have a play with the tutorials yourself.


23 thoughts on “An introduction to data scraping with Scraperwiki

  1. Pingback: Recommended Links for July 7th | Alex Gamela - Digital Media & Journalism

  2. Ed Walker

    This was a really useful post, thanks Paul. I’m going to the ScraperWiki event in Liverpool next week and really looking forward to learning lots.

  3. Pingback: L’opendata dans tous ses états – Juillet II «

  4. Pingback: What could a journalist do with ScraperWiki? A quick guide | Scraperwiki Data Blog

  5. Pingback: ScraperWiki, Hacks and Hackers - fromCONCENTRATE

  6. Matthew Watts

    I agree, scraping data has so many uses. It’s especially useful if you know of a few places you want to keep a watch on, but only want pieces of the data. You can then merge the data from multiple places and store in a database to output later. Like you said – this type of sorting is especially useful for journalism.

    @csharpp: XPath is hardly useless and actually speeds up scraping data while also allowing you many more options. If you aren’t the programing type, then maybe it isn’t “useful” to you, but it is far from “useless”.

    I wrote a tutorial for Advanced Data Scraping Using cURL and XPATH, you’ll see that it’s quite simple to sort multiple bits of information that can be stored in a database.

  7. Pingback: How to scrape webpages with Scraperwiki | Online Journalism Blog

  8. Pingback: Competitività e crescita con i dati liberati | Webeconoscenza

  9. Pingback: Competitività e crescita con i dati liberati

  10. Pingback: Make the Most of Free Data – How to Be a Data Journalist « PC’s Pixels and Posts

  11. Pingback: Tools: Pipes 1/2 « crosslabdataminor

  12. Pingback: Tools: Pipes 2/2 « crosslabdataminor

  13. Pingback: How to be a data journalist | Richard Hartley

  14. Pingback: 国外数据新闻资源分享 | 36大数据

  15. Pingback: 国外数据新闻资源分享 - 163大数据

  16. Pingback: 国外数据新闻资源分享 - 56iot

  17. Pingback: 一个言必称“大数据”的时代:国外大数据新闻资源分享-数据分析网

  18. Pingback: 国外数据新闻资源分享 | TEC.eyearth

  19. Pingback: Resources for Data Mining – Data Journalism Resources

  20. Pingback: THE GUARDIAN y Tony Hirst: El escándalo de gastos parlamentarios en Inglaterra - Blogs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.