Archive for the scraping Tag

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity: http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521 http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629 http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823 In this instance, you can see that
Read more…

Scraping data from a list of webpages using Google Docs

Quite often when you’re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one – or even scrape the data individually – would take time. Here I explain a way to use Google Docs to grab the data for
Read more…

SFTW: Asking questions of a webpage – and finding out when those answers change

Previously I wrote on how to use the =importXML formula in Google Docs to pull information from an XML page into a conventional spreadsheet. In this Something For The Weekend post I’ll show how to take that formula further to grab information from webpages – and get updates when that information changes. Asking questions of a webpage – or find
Read more…

All the news that’s fit to scrape

There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them: Scraping company information Chris Taggart explains how he built a database of
Read more…

Solving buggy behaviour when scraping data into Google spreadsheets

Tony Hirst has identified some bugs in the way Google spreadsheets ‘scrapes’ tables from other sources. In particular, when the original data is of mixed types (e.g. numbers and text). The solution is summed up as follows: “When using the =QUERY() formula, make sure that you’re importing data of the same datatype in each cell; and when using the =ImportData()formula, cast the
Read more…

An introduction to data scraping with Scraperwiki

Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms. It’s pretty good.

Data journalism pt2: Interrogating data

This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid. UPDATE: It has now been published in The Online Journalism Handbook. “One of the most important (and least technical)
Read more…