Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms.
It’s pretty good.
Why screen scraping is useful for journalists
Screen scraping can cover a range of things but for journalists it, initially, boils down to a few things:
- Getting information from somewhere
- Storing it somewhere that you can get to it later
- And in a form that makes it easy (or easier) to analyse and interrogate
So, for instance, you might use a screen scraper to gather information from a local police authority website, and store it in a lovely spreadsheet that you can then sort through, average, total up, filter and so on – when the alternative may have been to print off 80 PDFs and get out the highlighter pens, Post-Its and back-of-a-fag-packet calculations.
But those are just the initial aspects of screen scraping. Screen scraping tools like Scraperwiki or scripts you might write yourself offer further benefits that are also worth outlining:
- Scheduling a scraper to run at regular intervals (Adrian Holovaty compares this to making regular virtual trips to the local police station)
- Re-formatting data to clarify it, filter it, or make it compatible with other sets of data (for example, converting lat-long coordinates to postcodes, or feet to metres)
- Visualising data (for example as a chart, or on a map)
- Combining data from more than one source (for example, scraping a list of company directors and comparing that against a list of donors)
If you can think of any more, let me know.
How Scraperwiki works
Scraperwiki is not the only screen scraping tool out there. In fact, you can do simple scraping with Google Spreadsheets, the OutWit Firefox extension, or Yahoo! Pipes, to name just a few. And if you’ve never done scraping before, those are probably better places to start.
Scraperwiki is probably the next step up from those – giving you extra functionality and flexibility above and beyond merely scraping data to a spreadsheet.
The catch is that you will need to understand programming – not necessarily to be able to write it from scratch, but to be able to look at programming and make some educated guesses about ways to edit it to bring about a different result.
But then, as a journalist, you should be more than used to rewriting material to suit a particular objective – the skill is the same, right? Think of it as programming churnalism.
Even if you don’t understand programming, the site provides a range of tutorials to show you how it works – and it’s a good place to learn some basic programming even if you never use it to write a scraper, particularly as you can look at and adapt other scrapers, or find others to talk to about the process.
The biggest attraction for me of the site is the fact that you don’t have to fiddle around with setting up the programming environment that makes your code work – a particularly big hurdle to get over if you’re programming from scratch.
Of course, the more you understand about programming, the more you will be able to do – even to the extent of writing code from scratch. But remember that part of the skill of programming is being able to find code from elsewhere instead of having to write it all yourself. It’s about standing on the shoulders of giants as much as being a great Romantic original. Journalists could learn a lot from that ethos.
What else Scraperwiki does
If you want some data scraped, and don’t have the time or desire to learn how to write one, you can set a bounty for someone else to do it. You can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.
It is also a data repository – so even if you never scrape anything yourself, it’s worth subscribing to the RSS feed of the latest scrapers.
In a future post I’ll try to pick apart the code of a web scraper written in Python. But for now, if you have a free evening, have a play with the tutorials yourself.
Pingback: Recommended Links for July 7th | Alex Gamela - Digital Media & Journalism
This was a really useful post, thanks Paul. I’m going to the ScraperWiki event in Liverpool next week and really looking forward to learning lots.
Pingback: L’opendata dans tous ses états – Juillet II «
Pingback: What could a journalist do with ScraperWiki? A quick guide | Scraperwiki Data Blog
Pingback: ScraperWiki, Hacks and Hackers - fromCONCENTRATE
Try ScrapePro Web Scraper Designer instead of the useless XPath method.
I agree, scraping data has so many uses. It’s especially useful if you know of a few places you want to keep a watch on, but only want pieces of the data. You can then merge the data from multiple places and store in a database to output later. Like you said – this type of sorting is especially useful for journalism.
@csharpp: XPath is hardly useless and actually speeds up scraping data while also allowing you many more options. If you aren’t the programing type, then maybe it isn’t “useful” to you, but it is far from “useless”.
I wrote a tutorial for Advanced Data Scraping Using cURL and XPATH, you’ll see that it’s quite simple to sort multiple bits of information that can be stored in a database.
Pingback: How to scrape webpages with Scraperwiki | Online Journalism Blog
Pingback: Competitività e crescita con i dati liberati | Webeconoscenza
Pingback: Competitività e crescita con i dati liberati
Pingback: Make the Most of Free Data – How to Be a Data Journalist « PC’s Pixels and Posts
Pingback: Tools: Pipes 1/2 « crosslabdataminor
Pingback: Tools: Pipes 2/2 « crosslabdataminor
Ubot Studio Truth
Discussions of Ubot Studio bugs and short comings
You could also try using GrabzIt’s online screen scraper (http://grabz.it/scraper/), its easy to use, flexible and free.
Pingback: How to be a data journalist | Richard Hartley
Pingback: 国外数据新闻资源分享 | 36大数据
Pingback: 国外数据新闻资源分享 - 163大数据
Pingback: 国外数据新闻资源分享 - 56iot
Pingback: 国外数据新闻资源分享 | TEC.eyearth
Pingback: Resources for Data Mining – Data Journalism Resources
Pingback: THE GUARDIAN y Tony Hirst: El escándalo de gastos parlamentarios en Inglaterra - Blogs lanacion.com