Tagged: tony hirst

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah - forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0].geometry.location.lat).

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE - apparently, HESA codes are in the spreadsheet.... ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)


Getting Started With Local Council Spending Data

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
- compare the offerings with each other in more detail, e.g. the columns each view provides;
- compare the offerings with the guidance on release of council spending data;
- see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
- explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard


Where do I get that data? New Q&A site launched

Get the Data

Well here’s another gap in the data journalism process ever-so-slightly plugged: Tony Hirst blogs about a new Q&A site that Rufus Pollock has built. Get the Data allows you to “ask your data related questions, including, but not limited to, the following:

  • “where to find data relating to a particular issue;
  • “how to query Linked Data sources to get just the data set you require;
  • “what tools to use to explore a data set in a visual way;
  • “how to cleanse data or get it into a format you can work with using third party visualisation or analysis tools.”

As Tony explains (the site came out of a conversation between him and Rufus):

“In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of.”

He also invites people to help populate the site:

“If you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

“If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.”

Off you go then.

Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers

Where can I find a list of hospitals in the UK along with their location data? Or historical weather data for the UK? Or how do I find the county from a postcode, or a book title from its ISBN? And is there any way you can give me RDF Linked Data in a format I can actually use?!

With increasing amounts of data available, it can still be hard to:

- find the data you you want;
- query a datasource to return just the data you want;
- get the data from a datasource in a particular format;
- convert data from one format to another (Excel to RDF, for example, or CSV to JSON);
- get data into a representation that means it can be easily visualised using a pre-existing tool.

In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of. Or maybe you know the data is in Linked Data store on data.gov.uk, but you just can’t figure how to get it out?

This is where GetTheData.org comes in. Get The Data arose out of a conversation between myself and Rufus Pollock at the end of last year, which resulted with Rufus setting up the site now known as getTheData.org.

getTheData.org

The idea behind the site is to field questions and answers relating to the practicalities of working with public open data: from discovering data sets, to combining data from different sources in appropriate ways, getting data into formats you can happily work with, or that will play nicely with visualisation or analysis tools you already have, and so on.

At the moment, the site is in its startup/bootstrapping phase, although there is already some handy information up there. What we need now are your questions and answers…

So, if you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

If you’re looking for data as part of a research project, but can’t find it or can’t get it in an appropriate form that lets you link it to another data set, post a question to GetTheData.

If you want to do some graphical analysis on a data set, but don’t know what tool to use, or how to get the data in the right format for a particular tool, that’d be a good question to ask too.

Which is to say: if you want to GetTheData, but can’t do so for whatever reason, just ask… GetTheData.org


Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)


Making magazine awards more user-friendly

Given I’ve already linked to Tony Hirst twice this week I thought I’d make it a hat-trick. Last month Tony wrote two blog posts which I thought were particularly instructive for magazine publishers organising blog awards.

In the first post Tony complained after seeing Computer Weekly’s shortlist:

“Why, oh why, don’t publishers of blog award nomination lists see them as potentially useful collections on a particular subject that can be put to work for the benefit of that community?

“… There are umpteen categories – each category has it’s own web page – and umpteen nominations per award. To my mind, lists of nominations for an award are lists of items on a related topic. Where the items relate to blogs, presumably with an RSS feed associated with each, the lists should be published as an OPML file, so you can at-a-click subscribe to all the blogs on a list in a reader such as Google Reader, or via a dashboard such as netvibes. Where there are multiple awards, I’d provide an OPML file for each award, and a meta-bundle that collects nominations for all the awards together in a single OPML file, though with each category in its own nested outline element.”

I’d suggest something even more simple: an aggregator widget pulling together the RSS feeds for each category, or a new Twitter account, or a Google Reader bundle.

In a second post the following day Tony finds a further way to extract value from the list: use Google Custom Search to create a custom search engine limited to those sites you have shortlisted as award-worthy. His post explains exactly how to do that.

Tony’s approach demonstrates the difference between story-centred and data-centred approaches to journalism. Computer Weekly are approaching the awards as a story (largely because of limitations of platform and skills – see comments), with the ultimate ending ‘Blog publisher wins award’. Tony, however, is looking at the resources being gathered along the way: a list of blogs, each of which has an RSS feed, and each of which will be useful to readers and journalists. Both are valid, but ignoring either is to miss something valuable in your journalism.

Using Yahoo! Clues to target your headlines by demographic

Yahoo! Search Clues - Emma Watson hair

Tony Hirst points my attention (again) to Yahoo! Clues, a tool that, like Google’s Insights For Search, allows you to see what search terms are most popular. However, unlike Insights, Yahoo! Clues gives much deeper demographic information about who is searching for particular terms.

Tony’s interest is in how libraries might use it. I’m obviously interested in the publishing side – and search engine optimisation (SEO). And here’s where the tool is really interesting.

Until now SEO has generally taken a broad brush approach. You use tools like Insights to get an idea – based on the subject of your journalism – of what terms people are using, related terms, and rising terms. But what if your publication is specifically aimed at women – or men? Or under-25s? Or over-40s? Or the wealthy?

With Yahoo! Clues, if the search term is popular enough you can drill down to those groups with a bit more accuracy (US-only at the moment, though). Taking “Emma Watson haircut”, for example, you can see that a girls’ magazine and one aimed at boys may take different SEO approaches based on what they find from Yahoo! Clues.

Apart from anything else, it demonstrates just what an immature discipline web writing and SEO is. As more and more user data is available, processed at faster speeds, we should see this area develop considerably in the next decade.

UPDATE: After reading this post, Tony has written a follow-up post on other tools for seeing demographics around search behaviour.

Yahoo! Search Clues - Emma Watson haircut - oops/katie leung

Solving buggy behaviour when scraping data into Google spreadsheets

Tony Hirst has identified some bugs in the way Google spreadsheets ‘scrapes’ tables from other sources. In particular, when the original data is of mixed types (e.g. numbers and text). The solution is summed up as follows:

“When using the =QUERY() formula, make sure that you’re importing data of the same datatype in each cell; and when using the =ImportData()formula, cast the type of the columns yourself… (I’m assuming this persists, and doesn’t get reset each time the spreadsheet resynchs the imported data from the original URL?)”

First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
- grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
- for each group, grab it’s individual record page and extract the list of 20 qualifying members
- add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
- the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml” class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

- there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
- starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table

    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])

    if len(members)>20:
        members=members[:20]

    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record)
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?


Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I've tweaked the pipe to make it a little more robust...]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1

  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
- use Yahoo placemaker to geocode the address provided;
- search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
- identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
- identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…