Monthly Archives: October 2010

John Rentoul, Media Oops Number 1 : You cannot close the door once a blog post has bolted

John Rentoul of the Independent has the blog with the longest running single-blog meme in the known world. “Questions to which the answer is no” is now up to number 411 (“Will Barclays carry out its threat to leave UK?“),

I can’t compete with that, so I thought I’d start a list of Media Oops-es, i.e., cockups. This is all in the interest of media transparency, you understand. Shooting from the hip is just as big a problem for blogging journalists as it is for rednecks and Harriet Harman – though I suspect her invective was planned.

(Update: since this is about educating student journalists, I thought I would cross-post to the Online Journalism Blog in addition to the Wardman Wire).

The first one comes via Justin McKeating, who’s doing something slightly similar, though I suspect we’ll be tracking different bits of media silliness.

Rentoul came up with a slightly unflattering comparison:

A friend draws my attention to a resemblance I had not noticed.

Ed Miliband, he says, reminds him of Watto, the hovering, scuzzy garage owner on Tatooine who enslaves little boys in Star Wars Episode I: The Phantom Menace, my favourite film of the six.

Miliband spoke in his speech to Labour conference of his being compared to “Wallace out of Wallace and Gromit” – although he department from the text issued, “I can see the resemblance”, to say: “I gather some people can see the resemblance.”

But I thought he looked more like Gromit – the dog who is cleverer than his master who expresses himself mainly by his eyebrows.

If he’d just left it there none of us would have made a fuss. But he thought better of it and deleted the piece. As Justin says:

It looks like the mighty John Rentoul thought better of comparing Ed Miliband to the Watto character from The Phantom Menace and pulled the post without comment. You now get a ‘page not found’ error when you click on the link. Particularly piquant was when Rentoul noted Watto is ‘scuzzy’ and ‘enslaves little boys’. And he deleted his tweet advertising his insightful blog post (we know it was there because somebody replied to it). What a shame, denying future students of journalism this exemplary example of the craft.

Who am I to deny an education to students of journalism? I love computer networks with memories; and also search engines with caches.


For the record, here’s the Milliman, who Rentoul (and everybody else) has previously compared to a panda:


The best bit is that the next Rentoul blog post was all about “tasteless metaphors“.

Pot. Kettle. White and black.

(Update: since this is about educating student journalists, I thought I would cross-post to the Online Journalism Blog).


Do bloggers devalue journalism?

Science journalist Angela Saini has written an interesting post on ‘devaluing journalism’ that I felt I had to respond to. “The profession [of journalism] is being devalued,” she argues.

“Firstly, by magazines and newspapers that are turning to bloggers for content instead of experienced journalists. And secondly, by people who are willing to work for free or for very little (interns, bloggers, cut-price freelancers). Now this is fine if you’re just running your own site in your spare time, but the media is always going to suffer if journalists don’t demand fair pay for doing real stories. Editors will get away with undercutting their writers. Plus, they’ll be much keener to employ legions of churnalists on the cheap. In the long-run, the quality of stories will fall.”

Firstly let me say that I broadly agree with most of what Angela is saying: that full time journalists offer something that other participants in journalism do not; and that publishers and editors see interns and bloggers as sources of cheap content. I also strongly support interns being paid.

But I think Angela mixes economic value with editorial value, and that undermines the general thrust of the argument.

What reduces the value of something economically? Angela’s argument seems to rest on the idea of increased supply. And indeed, entry-level journalism wages have been consistently depressed partly as a result of increasing numbers of people who want to be journalists and who will work for free, or for low wages – but also partly because of the demands of and pressures on the industry itself.

UPDATE: Ben Mazzotta fleshes out the subtleties of the economics above  nicely, although I think he misinterprets the point I’m trying to make.

“Although entry-level journalists are badly paid, that doesn’t necessarily have anything to do with the economics of Nicholas Kristof’s salary. Kristof’s pay goes up or down based on what the papers can afford, which is driven by subscriptions and advertising. In fact, the more liars and bad writers are out there with healthy audiences, the bigger is the pie for the best journalists to fight over. Effectively, that’s just a few million more hacks that Kristof is better than. The best columns in journalism are a classic positional good: their worth is determined by how much better they are than their competitors.”

Editorial value, not economic

Angela’s point, however, is not about the economic value of professional journalism but the editorial value – the quality, not the quantity.

There’s an obvious link between the two. Pay people very little, and they won’t stick around to become better reporters (witness how many journalists leave the profession for PR as soon as they have families to feed). Rely on interns and you not only have a more unskilled workforce but the skilled part of your workforce has to spend part of its time doing informal ‘training’ of those interns.

So where do bloggers come in? Angela mentions them in two senses: firstly as being chosen over experienced journalists, and second as part of a list of people willing to work for little or for free.

‘Blogger’ is meaningless

But, unlike the labels ‘intern’ and ‘freelance journalist’, ‘blogger’ is a definition by platform not by occupation, and takes in a vast range of people, some of whom are very experienced journalists themselves (with high rates), and some of whom have more specialist expertise than journalists. It also includes aspiring journalists and “cut price freelancers”.

Does their existence ‘devalue’ journalism? Economically, you might argue that it increases the supply of journalism and so drives down its price (I wouldn’t, but you might. That’s not my point).

But editorially? Well, here we have to take in a new factor: bloggers don’t have to write about what publishers tell them to. And most of them don’t. So while the increase in bloggers has expanded the potential market for contributors – it’s also expanded the content competing with your own. Competition – in strictly economic terms – is supposed to drive quality up. But I’m not going to argue that that’s happening, because this is not a market economy we’re looking at, but a mixed one.

I guess my point is that this isn’t a simple either/or calculation any more. The drive to reduce costs and increase profits has always led to the ‘devaluing of journalism’ as a profession. Blogging and the broader ability for anyone to publish does little to change that. What it does do, however, is introduce different dynamics into the picture. When you divorce ‘journalism’ from its commercial face, ‘publishing’, as the internet has done, then you also break down the relationship between economic devaluation and editorial devaluation when it comes to journalism in aggregate.

First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”; class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>

<p xmlns=”; class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>

– there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=””>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = ''
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    # The elements in the number column are irrelevant
    # Hackery...:-( There must be a better way...!
    print table


    for line in lines:
        if not line.get('style'):
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
                # value was valid ASCII data
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'

    if len(members)>20:

    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}["id"], record)
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  return appid

def placemakerGeocodeLatLon(address):
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  while page<=t:
    if term!='':

    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      if u in tweeters:
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:

  return tweeters,tags

''' Usage:
num=100 #number of search results, best as a multiple of 100 up to max 1500

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

Hyperlocal voices: Will Perrin, Kings Cross Environment

hyperlocal blogger Will Perrin

Will Perrin has spoken widely about his experiences with, a site he set up four years ago “as a desperate measure to help with local civic activism”. In the latest in the Hyperlocal Voices series, he explains how news comes far down their list of priorities, and the importance of real world networks.

Who were the people behind the blog, and what were their backgrounds?

I set it up solo in 2006, local campaigner Stephan joined late in 2006 and Sophie shortly thereafter. The three of us write regularly – me a civil servant for most of my time on the site, Sophie an actor, Stephan a retired media executive.

We had all been active in our communities for many years on a range of issues with very different perspectives. There are four or five others who contribute occasionally and a network of 20 or more folk who send us stuff for the site.

What made you decide to set up the blog?

The site was simply a tool to help co-ordinate civic action on the ground. The site was set up in 2006 as a desperate measure to help with local civic activism.

I was totally overwhelmed with reports, documents, minutes of meetings and was generating a lot of photos of broken things on the street. The council had just created a new resident-led committee for me and the burden was going to increase. Also I kept bumping into loads of other people who were active in the community but no one knew what the others were doing. I knew that the internet was a good way of organising information but wasn’t sure how to do it. Continue reading

Open data in Spain – guest post by Ricard Espelt

Ahead of speaking this week in Barcelona, I spoke to a few people in Spain about the situation regarding open data in the country. One of those people is Ricard Espelt, a member of Nuestracausa, “a group of people who wanted to work on projects like MySociety [in Spain]”. The group broke up and Ricard now runs Redall Comunicacao. Among Ricard’s projects is Copons 2.0: an “approach to consensus decision making”.

This is what Ricard had to say about the problems around open data, e-democracy and bottom-up projects in Spain:

I think there are three points to bear in mind when we to try to analyse how the tools are changing politics & public administration:

  • The process of the governments to review data, so it will be easier to use data for all the citizens. Open data.
  • The process of the governments to involve the citizens in the decisions. E-democracy.
  • The action of the citizens (individuals or groups) to engage other citizens to work for the community. Is a good way to make lobby and influence in the decisions of the governments.

Spain, like other countries, has been developing all these points with different levels of success. Continue reading

Hyperlocal voices: the Worst of Perth

Having already interviewed hyperlocal bloggers in the US and the Netherlands, this week’s Hyperlocal Voices profiles an Australian blogger: The Worst of Perth. Launched 3 years ago to criticise a local newspaper, the blog is approaching a million views this year and has the an impact on the local political scene.

Who were the people behind the blog, and what were their backgrounds?

Just me. I have a background in stand-up comedy and photography amongst many things, with a bit of dabbling in graphic design and art too.

I used to work for quite a while in video production, (as well as a few occasions as best boy/lighting assistant in a tax write-off kung fu/zombie movie or two). I currently work for Curtin University and am also a student of Mandarin.

What made you decide to set up the blog?

Heh. Well, amusingly from an online journalism point of view, my very first motivation was to label a senior print journo “Australia’s worst journalist”!

Perth has a single daily newspaper, The West Australian, (circ I think about 250 000 daily) which has in many people’s opinion not been best served by being the monopoly daily provider. The paper and its journalists used to be a frequent target of TWOP, but not so much anymore.

The reason for this is at the heart of what’s happening to journalism around the world. Because The West was the only daily paper, in pre-news blog times, people used to be passionate about its faults.

Now no-one really cares how bad it is, because they can get their real news elsewhere. The paper hasn’t got any better, in fact it’s consistently worse, but the difference now is that nobody really cares that much. Continue reading