First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

– there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table

    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])

    if len(members)>20:
        members=members[:20]

    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record)
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1

  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

Hyperlocal voices: Will Perrin, Kings Cross Environment

hyperlocal blogger Will Perrin

Will Perrin has spoken widely about his experiences with www.kingscrossenvironment.com, a site he set up four years ago “as a desperate measure to help with local civic activism”. In the latest in the Hyperlocal Voices series, he explains how news comes far down their list of priorities, and the importance of real world networks.

Who were the people behind the blog, and what were their backgrounds?

I set it up solo in 2006, local campaigner Stephan joined late in 2006 and Sophie shortly thereafter. The three of us write regularly – me a civil servant for most of my time on the site, Sophie an actor, Stephan a retired media executive.

We had all been active in our communities for many years on a range of issues with very different perspectives. There are four or five others who contribute occasionally and a network of 20 or more folk who send us stuff for the site.

What made you decide to set up the blog?

The site was simply a tool to help co-ordinate civic action on the ground. The site was set up in 2006 as a desperate measure to help with local civic activism.

I was totally overwhelmed with reports, documents, minutes of meetings and was generating a lot of photos of broken things on the street. The council had just created a new resident-led committee for me and the burden was going to increase. Also I kept bumping into loads of other people who were active in the community but no one knew what the others were doing. I knew that the internet was a good way of organising information but wasn’t sure how to do it. Continue reading

Open data in Spain – guest post by Ricard Espelt

Ahead of speaking this week in Barcelona, I spoke to a few people in Spain about the situation regarding open data in the country. One of those people is Ricard Espelt, a member of Nuestracausa, “a group of people who wanted to work on projects like MySociety [in Spain]”. The group broke up and Ricard now runs Redall Comunicacao. Among Ricard’s projects is Copons 2.0: an “approach to consensus decision making”.

This is what Ricard had to say about the problems around open data, e-democracy and bottom-up projects in Spain:

I think there are three points to bear in mind when we to try to analyse how the tools are changing politics & public administration:

  • The process of the governments to review data, so it will be easier to use data for all the citizens. Open data.
  • The process of the governments to involve the citizens in the decisions. E-democracy.
  • The action of the citizens (individuals or groups) to engage other citizens to work for the community. Is a good way to make lobby and influence in the decisions of the governments.

Spain, like other countries, has been developing all these points with different levels of success. Continue reading

Hyperlocal voices: the Worst of Perth

Having already interviewed hyperlocal bloggers in the US and the Netherlands, this week’s Hyperlocal Voices profiles an Australian blogger: The Worst of Perth. Launched 3 years ago to criticise a local newspaper, the blog is approaching a million views this year and has the an impact on the local political scene.

Who were the people behind the blog, and what were their backgrounds?

Just me. I have a background in stand-up comedy and photography amongst many things, with a bit of dabbling in graphic design and art too.

I used to work for quite a while in video production, (as well as a few occasions as best boy/lighting assistant in a tax write-off kung fu/zombie movie or two). I currently work for Curtin University and am also a student of Mandarin.

What made you decide to set up the blog?

Heh. Well, amusingly from an online journalism point of view, my very first motivation was to label a senior print journo “Australia’s worst journalist”!

Perth has a single daily newspaper, The West Australian, (circ I think about 250 000 daily) which has in many people’s opinion not been best served by being the monopoly daily provider. The paper and its journalists used to be a frequent target of TWOP, but not so much anymore.

The reason for this is at the heart of what’s happening to journalism around the world. Because The West was the only daily paper, in pre-news blog times, people used to be passionate about its faults.

Now no-one really cares how bad it is, because they can get their real news elsewhere. The paper hasn’t got any better, in fact it’s consistently worse, but the difference now is that nobody really cares that much. Continue reading

Guest post – launching hyperlocal startups: Opinion 250 and Locally Informed

In a guest post for the Online Journalism Blog, Shane Redlick shares his experiences of launching two hyperlocal startups – one, launched 5 years ago, based on a traditional advertising model. The second – launched this year – seeking to innovate with a broker-based model and crowdsourcing technologies.

2005: Opinion 250 News

In 2005, myself along with 2 partners launched the hyperlocal startup Opinion 250 News in Prince George, British Columbia (Canada). Myself and my company performed technical development, admin and financial tasks, while the other 2 partners (long time media industry people/semi-retired) did all the reporting and managed a small team of topical/weekly writers.

All content is original for local news. We had a lot going for us and we managed to make some good gains in the first year. To date the company is profitable and can pay modest salaries for those involved. But it has taken the better of 4 years to reach that point.

The effect we were having locally was significant (read comments to story here, for instance). The biggest challenge for us was building monthly ad revenue.

We did not sell on CPC or CPM basis. It was a flat monthly cost. We had a couple of people selling the ads and we had quite a bit of local good will and resulting support via ads. Even with a lot going for us however, this was a big challenge. In fact in the first month, when we launched, we’d sold nearly $10,000 CAD (monthly recurring) in ads. Continue reading

Help Me Investigate – anatomy of an investigation

Earlier this year I and Andy Brightwell conducted some research into one of the successful investigations on my crowdsourcing platform Help Me Investigate. I wanted to know what had made the investigation successful – and how (or if) we might replicate those conditions for other investigations.

I presented the findings (presentation embedded above) at the Journalism’s Next Top Model conference in June. This post sums up those findings.

The investigation in question was ‘What do you know about The London Weekly?‘ – an investigation into a free newspaper that was (they claimed – part of the investigation was to establish if this was a hoax) about to launch in London.

The people behind the paper had made a number of claims about planned circulation, staffing and investment that most of the media reported uncritically. Martin Stabe, James Ball and Judith Townend, however, wanted to dig deeper. So, after an exchange on Twitter, Judith logged onto Help Me Investigate and started an investigation.

A month later members of the investigation had unearthed a wealth of detail about the people behind The London Weekly and the facts behind their claims. Some of the information was reported in MediaWeek and The Media Guardian podcast Media Talk; some formed the basis for posts on James Ball’s blog, Journalism.co.uk and the Online Journalism Blog. Some has, for legal reasons, remained unpublished. Continue reading

Manchester Police tweets and the MEN – local data journalism part 2

Manchester Evening News visualisation of Police incident tweets

A week ago I blogged about how the Manchester Evening News were using data visualisation to provide a deeper analysis of the local police force’s experiment in tweeting incidents for 24 hours. In that post Head of Online Content Paul Gallagher said he thought the real benefit would “come afterwards when we can also plot the data over time”.

Now that data has been plotted, and you can see the results here.

In addition, you can filter the results by area, type (crime or ‘social work’) and category (specific sort of crime or social issue). To give the technical background: Carl Johnstone put the data into a mysql database, wrote some code in Perl for the filters and used a Flash applet for the graphs. Continue reading

Creating an emergency notification system in 15 hours

I’ve written a post on the Scraperwiki blog about a hackathon I attended where a small group of developers and people with experience of crowdsourcing in emergencies created a fantastic tool to inform populations in an emergency.

The primary application is non-journalistic, but the subject matter has obvious journalistic potential for any event that requires exchanges of information. Here are just some that spring to mind:

  • A protest where protestors and local residents can find out where it is at that moment and what streets are closed.
  • A football match with potential for violence (i.e. local derby) where supporters can be alerted of any trouble and what routes to use to avoid it.
  • A music festival where you could text the name of the bands you want to see and receive alerts of scheduled appearances and any delays
  • A conference where you could receive all the above – as well as text updates on presentations that you’re missing (taken from hashtagged tweets, even)

There are obvious commercial applications for some of the above too – you might have to register your mobile ahead of the event and pay a fee to ensure you receive the texts.

Not bad for 15 hours’ work.

You can read the blog post in full here.

A template for '100 percent reporting'

progress bar for 100 percent reporting

Last night Jay Rosen blogged about a wonderful framework for networked journalism – what he calls the ‘100 percent solution‘:

“First, you set a goal to cover 100 percent of… well, of something. In trying to reach the goal you immediately run into problems. To solve those problems you often have to improvise or innovate. And that’s the payoff, even if you don’t meet your goal”

In the first example, he mentions a spreadsheet. So I thought I’d create a template for that spreadsheet that tells you just how far you are in achieving your 100% goal, makes it easier to organise newsgathering across a network of actors, and introduces game mechanics to make the process more pleasurable. Continue reading