Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(
So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.
And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.
Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.
The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

Secondly, pages for each group, which are published according to a common template:

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)
So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>
…
<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>
– there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>
Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:
def fullscrape():
# We're going to scrape the APG directory page to get the URLs to the subject group pages
starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
html = scraperwiki.scrape(starting_url)
soup = BeautifulSoup(html)
# We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
start=soup.find(text='Section 2: Subject Groups')
# The links we want are in p tags
links = start.findAllNext('p',"contentsLink")
for link in links:
# The urls we want are in the href attribute of the a tag, the group name is in the a tag text
#print link.a.text,link.a['href']
apgPageScrape(link.a.text, link.a['href'])
So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.
This is the bit of the page we want to scrape:

The 20 qualifying members’ names are actually contained in a single table row:

def apgPageScrape(apg,page):
print "Trying",apg
url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
html = scraperwiki.scrape(url)
soup = BeautifulSoup(html)
#get into the table
start=soup.find(text='Main Opposition Party')
# get to the table
table=start.parent.parent.parent.parent
# The elements in the number column are irrelevant
table=table.find(text='10')
# Hackery...:-( There must be a better way...!
table=table.parent.parent.parent
print table
lines=table.findAll('p')
members=[]
for line in lines:
if not line.get('style'):
m=line.text.encode('utf-8')
m=m.strip()
#strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
m=m.replace('-','–')
m=m.split('–')
# I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
try:
unicode(m[0], "ascii")
except UnicodeError:
m[0] = unicode(m[0], "utf-8")
else:
# value was valid ASCII data
pass
# The split test is another hack: it dumps the party identifiers in the last column
if m[0]!='' and len(m[0].split())>1:
print '...'+m[0]+'++++'
members.append(m[0])
if len(members)>20:
members=members[:20]
for m in members:
#print m
record= { "id":apg+":"+m, "mp":m,"apg":apg}
scraperwiki.datastore.save(["id"], record)
print "....done",apg
So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

That’s the first step then – scraping the data… But so what?
My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)
We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).
I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….
PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?
