Cooks Source: What should Judith Griggs have done?

It’s barely 24 hours since the Cooks Source/Judith Griggs saga blew up, but so much has happened in that time that I thought it worth reflecting on how other publishers might handle a similar situation.

Although it’s an extreme example, the story has particular relevance to those publications that rely on Facebook or another web presence to publish material online and communicate with readers, and might at some point face a backlash on that platform.

In the case of Cooks Source, their Facebook page went from 100 ‘likes’ to over 3,000, as people ‘liked’ the page in order to post a critical comment (given the huge numbers of comments it’s fair to say there were many more people who un-‘liked’ the page as soon as their comment was posted). The first question that many publishers looking at this might ask is defensive:

Should you have a Facebook page at all?

It would be easy to take the Cooks Source case as an indication that you shouldn’t have a Facebook page at all – on the basis that it might become hijacked by your critics or enemies. Or that if you do create a page you should do so in a way that does not allow postings to the wall.

The problem with this approach is that it misunderstands the fundamental shift in power between publisher and reader. Just as Monica Gaudio was able to tell the world about Judith’s cavalier attitude to copyright, not having a Facebook page (or blog, etc.) for your publication doesn’t prevent one existing at all.

In fact, if you don’t set up a space where your readers can communicate with you and each other, it’s likely that they’ll set one up themselves – and that introduces further problems.

If you don’t have a presence online, someone else will create a fake one to attack you with

After people heard about the Cooks Source story, it wasn’t long before some took the opportunity to set up fake Twitter accounts and a Facebook user account in Judith’s name. (UPDATE: Someone has registered JudithGriggs.com and pointed it at the Wikipedia entry for ‘public domain’, while a further Cooks Source Facebook page has been set up claiming that the original was “hacked”)

These were used in various ways: to make information available (the Twitter account biography featured Judith’s phone number and email); to satirise Judith’s actions through mock-updates; and to tease easily-annoyed Facebook posters into angry responses.

Some people’s responses on Facebook to the ‘fake’ Judith suggested they did not realise that she was not the real thing, which leads to the next point.

A passive presence isn’t enough – be active

Judith obviously did have a Facebook account, but it was her slowness to respond to the critics that allowed others to impersonate her.

Indeed, it was several hours before Judith Griggs made any response on the Facebook page, and when she did (assuming it is genuine – see comments below) it was through the page’s welcoming message – in other words, it was a broadcast.

This might be understandable given the unmanageable volume of comments that had been posted by this time – but her message was also therefore easily missed in the depths of the conversation, and it meant that the ‘fake’ Judith was able to continue to impersonate her in responses to those messages.

One way to focus her actions in a meaningful way might have been to do a ‘Find’ on “Griggs” and respond there to clarify that this person was an imposter.

Instead, by being passive Judith created a vacuum. The activity that filled that vacuum led in all directions, including investigating the magazine more broadly and contacting advertisers and stockists.

Climb down quickly and unreservedly

While being passive can create a vacuum, being active can – if not done in a considered way – also simply add fuel to the fire.

The message that Judith eventually posted did just that. “I did apologise to Monica via email, but aparently [sic] it wasnt enough for her,” she wrote, before saying “You did find a way to get your “pound of flesh…””.

This “blaming the victim”, as one wall poster described it, compounded the situation and merely confirmed Judith’s misunderstanding of the anger directed at her.

An apology clearly wasn’t what people wanted – or at least, not this sort of reserved apology.

A quicker, fuller response that demonstrated an understanding of her community would have made an enormous difference in channeling the energy that people poured into what became an increasingly aggressive campaign.

UPDATE (Nov 9): As of a few hours ago Cooks Source appear to have published an official statement which includes a more fullsome apology. The statement doesn’t help, however, partly because it doesn’t address the key issues raised by critics about where it gets content and images from, partly because its sense of priorities doesn’t match those of its audience (the apology comes quite late in the statement), and partly because it is internally inconsistent. Commenters on the Facebook page and blogs have already picked these apart.

There’s also a wonderful ‘corrected’ version of the statement which does an impeccable job of illustrating how they should have phrased it.

Engage with criticism elsewhere

The Cooks Source Facebook page wasn’t the only place where people were gathering to criticise and investigate the magazine. On Reddit hundreds of users collaborated to find other breaches of copyright, put up contact details for the copyright holders, and list advertisers that people could contact. Someone also created a Wikipedia entry to document Griggs’ instant notoriety.

Even if Judith had shut down the Facebook page (not a good idea – it would have merely added further fuel to the fire), the discussion – which had now become a campaign and investigation – was taking place elsewhere. Engaging in that in a positive way might have helped.

A magazine is not just content

One of the key principles demonstrated by the whole affair is that magazines are about much more than just the content inside, but about the community around it, and their values. This is what advertisers are buying into. When I asked one of Cooks Source’s advertisers why they decided to withdraw their support, this is what they said:

“I would estimate that between the emails, [Facebook] messages, calls, and people following us on Twitter, we’ve been contacted by more than 100 people since I first heard of this about 5 hours ago. That doesn’t include many many people who commented on fb to our posts stating that we had requested to pull our ads from the publication. We are just simply trying to run our small business, which by most standards is still in its infancy, and being associated with publications like this that don’t respect its readers (who are all our potential customers) is unacceptable to us in light of their practices. What angers me even more is the fact that it is being made light if by the Editor herself. It is disrupting our business and linking us to something we do not support.”

Postscript: How it unfolded, piece by piece

Kathy E Gill has a wonderfully detailed timeline of how the story broke and developed which offers further lessons in how a situation like this develops.

Cooks Source magazine gets Facebook backlash for copying material without permission

UPDATE 7: The official Cooks Source webpage now features a rather confusing statement on the saga, apologising to Monica Gaudio and saying they have made the donation asked for. The page claims that their Facebook page was “cancelled” and “since hacked”. It’s not clear what they mean by these terms as the original Facebook page is still up and, clearly, could not be hacked if it had been “cancelled”. They may be referring to the duplicate Facebook page which also claims (falsely) the original was “hacked”. In addition the statement says they have “cancelled” their website – but as the statement is published on their website it may be that by “cancelled” they mean all previous content has been removed. This discussion thread picks out further inconsistencies and omissions.

UPDATE 5: The magazine’s Facebook page has now been updated with a message from editor Judith saying she “did apologise” but “apparently it wasn’t enough for her”, shown below:

Well, here I am with egg on my face! I did apologise to Monica via email, but aparently it wasnt enough for her. To all of you, thank you for your interest in Cooks Source and Again, to Monica, I am sorry -- my bad! You did find a way to get your

UPDATE 2: Reddit users have been digging further into the magazine’s use of copyrighted content. They’ve also identified a planned sister magazine, whose Facebook page has also been the recipient of a few comments.

UPDATE 6: Edward Champion has chased down the copyright holders of both text and images found in Cooks Source which appear to have been used without permission.

UPDATE 4: A list of mainstream media reports on the story is also being maintained on the magazine’s Facebook page.

***ORIGINAL BLOG POST STARTS HERE***

For much of today people have been tweeting and blogging about the magazine editor with 30 years’ experience demonstrating a by now familiar misunderstanding of copyright law and the ‘public domain’.

The blog post on Tweetmeme - shared over 1500 times

Reddit: Website article gets copied without permission by print magazine - website complains - magazine claims website should pay them for the publicity

To the writer whose material they used without permission she apparently responded that “the web is considered “public domain” and you should be happy we just didn’t “lift” your whole article and put someone else’s name on it!”

What makes this of particular interest is how the affair has blown up not just across Twitter and Reddit but on the magazine’s own Facebook page, demonstrating how this sort of mistake can impact very directly on your own readers – and stockists and advertisers:

As an advertiser, we are disappointed in Cook's Source and we are pulling our ads from this publication. Many of us (as is the case with our business) paid several months in  advance for advertising and are unlikely to get any compensation back.  We ask that you please stop emailing our business, we agree that the  publication made a grave error, but the blame should be placed with  them. Please do not make small businesses like mine pay for their error  in judgment

Facebook comment

Jim Cobb Perhaps someone should obtain a recent copy of the magazine and begin contacting any paid advertisers. Y'know, to clue them in on the business practices of Cooks Source Magazine. They might be interested in hearing about it.

Jon F. Merz If I could draw everyone's attention to the photos down below which contain reprints of magazine pages, that include all of their advertisers. Let's start calling these places up and letting these advertisers know that the money they pay goes to keep a rag like this in business. Hurt 'em where it counts!

Kristine Weil In light of your blatant theft of Monica Gaudio's article and the dismissive response of editor Judith Griggs when called on it by the author, I will be personally speaking to the manager of our local grocery store to encourage them to stop carrying your magazine, and I will continue to speak to them every week until

Meanwhile, others were suggesting investigating the magazine further:

It all adds up to a perfect lesson for magazine editors – not just in copyright, but in PR and community management.

UPDATE 1: It seems that users are going through the latest issue and suggesting where the content may have been taken from.

UPDATE 3 On a separate topics page on the Facebook page the details are being collated.

Open data from the inside: Lichfield Council’s Stuart Harrison

I’m trying to get a feel for what some of the most innovative government departments and local authorities are doing around releasing data. I spoke to Stuart Harrison of Lichfield Council, which is leading the way at a local level.

What has been your involvement with open data so far?

I’ve been interested in open data for a few years now. It all started when I was building a site for food safety inspections in Staffordshire (http://www.ratemyplace.org.uk/), and after seeing the open APIs offered by sites such as Fixmystreet, Theyworkforyou etc, was inspired to add an API (http://www.ratemyplace.org.uk/api). This then got me thinking about all the data we publish on our website, and whether we could publish this in an open format. A trickle quickly turned into a flood and we now have over 50 individual items of open data at http://www.lichfielddc.gov.uk/data.

I think the main thing I’ve learnt is that APIs are great, but they’re not always necessary. My early work was on APIs that link directly into databases, but, as I’ve moved forward, I’ve found that this isn’t always necessary. While an API is nice to have, it’s sometimes much better to just get the data out there in a raw format.

What have people done with the data so far?

As we’re quite a small council, we haven’t had a lot of people doing work (that I know of) with much of our data. The biggest user of our data is probably Chris Taggart at Openly Local – I actually built an API (and extended the functionality of our existing councillor and committees system) to make it easier to republish. To be honest, unless I know the person and they actually told me, I doubt I’d actually know what was going on!

What do you plan to do next – and why?

Because of the problems stated before, we’ve got together with ScraperWiki to organise a Hacks and Hackers day on the 11th November, which will hopefully encourage developers and journalists to do something with our data, and also put the wheels in motion for organising a data-based community, which means that once someone does something with our data, we’re more likely to know about it!

Online journalism and the promises of new technology PART 6: Conclusion

I totally forgot to wrap up this series – but here it is; the conclusion. Sorry about the delay. And by the way; the whole series is now published (in a slightly different version) as an article in the journal Journalism Studies (restricted access).

Here are the previous posts:

  • The revolution that never happened (part 1)
  • The three main assets of new technology to online journalism — interactivity, hypertext and multimedia (part 2)
  • Online journalism as hypertext (part 3)
  • Online journalism and interactivity (part 4)
  • Online journalism and multimedia (part 5)

The previous posts of this series have left an impression that online journalism is left behind by the technological developments in new media. Linear text is preferred over hypertext and multimedia (hypermedia). Traditional norms of gatekeeping are preferred over participatory journalism and alternative flows of information, albeit interactivity seems to play a larger role when it comes to how major breaking news events, like crises events, are researched and covered.

Journalists and editors seem, at least to some extent, eager to embrace change brought forward by new technology, while users don’t seem to care. All in all; it seems that technology may not be the main driving force of developments in online journalism. The question is therefore: how can research on online journalism better grasp why online journalism develops as it does?

Some researchers suggest that ethnography and a closer look at the practices and routines of online news production is the answer. Pablo Boczkowski is a premier example of this trend, first with his 2004 “classic” Digitizing the News, and now with the newly released News at Work — a book in which he investigates online journalism from multiple perspectives (from inside the newsroom to audience perception).

The case studies presented in Domingo and Paterson (eds) Making Online News (2008) — a book which comes in a new edition with new studies next year — are also examples of this ethnography-trend. However, both Boczkowski’s work and the studies in Making Online News are quite  dominated by the technological discourse.

Some other studies also utilize ethnographic methodology, but from a broader, albeit still technology oriented, approach that aim at finding out how convergence of newsrooms affect the production of journalism (Dupagne and Garrison, 2006 (pdf); Erdal, 2009 (PhD dissertation as pdf); Klinenberg, 2005 (restricted access); Lawson-Borders, 2006 (preview in Google books).

These studies provide valuable insights into the complexity of online journalism production and put forward findings that shed light on why technology is not utilized to the degree that has been previously postulated.

Notwithstanding the significant contributions of these studies, there are still many shortcomings of the research on online journalism. I will conclude this series with six suggestions for further research.

First, studies of online journalism could benefit from a broader contextualization Mitchelstein and Boczkowski (2009) (restricted access) argue that the research on online journalism lacks historical dimensions. Relating online journalism to developments in journalism prior to the Internet boom could therefore be a suggestion. Viewing online journalism in reliance to media theory and how media and media products transform over time could be another. Mitchelstein and Boczkowski (2009) also identify a need for more cross-national studies, and for online journalism researchers to look beyond the newsroom and the news industry and take into account structural factors of for instance the labor market and comparable processes in other industries in order to better understand “who gets to produce online news, how that production takes place, and what stories result from these dynamics” (2009, 576). It should however be noted that Mark Deuze’s 2007 book Media Work (preview in Google books) and a special issue of the journal Journalism on Newsworkto some extent address these shortcomings.

Second, the research on online journalism is flooded by a range of theoretical concepts that are either interchangeable or are interpreted differently by different researchers. Concepts like interactivity, hypertext and multimedia are understood in different ways, and other concept, like genre and innovation are generally used without any theoretical discussion on what they represent and how they might inform the research on online journalism. A stronger emphasis on conceptualization is therefore needed

Third, most of the research on online journalism is limited to a focus on the presentation and to some degree the production and reception of hard/breaking news and the rhetoric of online news sites’ front-pages. The development of other genres therefore seems to have been downplayed in the research, even though some studies have been conducted on online feature journalism (Boczkowski 2009 (restricted access); and some of my own research (my PhD as pdf)). Furthermore, sections and stories that are reached by other means than via links from the front-page (e.g. traffic to stories and sections generated from search engines) seem to be under-represented in the research. A stronger emphasis on the diversification of online journalism is therefore needed.

Fourth, research on online journalism could benefit from a greater recognition of and reflection on the text as a research unit. Although most research on online journalism deals with text in one way or the other, there is a striking neglect of theoretical and methodological reflections on what texts are, how they facilitate communication, how they relate to media, and how they connect media with society. Genre theory and discourse analysis could for instance be valuable tools to establish research approaches that aim at investigating online journalism as communication. Lüders et al. (2010) (restricted access), for instance, show how the concept of genre provides vital insights into the emergence of new media like the personal weblog.

Fifth, although some of the research mentioned in this series makes longitudinal claims, the empirical material is seldom of longitudinal character. This seems to be a flaw considering the swift development of online journalism and the lack of common theoretical and methodological approaches, which makes comparisons between findings difficult.

And finally, sixth, research of online journalism suffers from a methodological deficiency. The research is dominated by content analysis, surveys and interviews. Qualitative approaches are rarely utilized, even though ethnographic news production studies seem to gain popularity. However, given the limited cases that are possible to investigate with such a methodology, more ethnographic research is need. Furthermore, content analysis should to a greater extent be combined with qualitative textual analysis of online journalism texts – all in order to uncover the complexity of online journalism.

Hyperlocal voices: Daniel Ionescu, The Lincolnite

hyperlocal voices - The Lincolnite

The latest in the Hyperlocal Voices series looks at new hyperlocal blog The Lincolnite, launched by recent Lincoln University graduates, who also managed to secure funding for their venture.

Who were the people behind The Lincolnite, and what were their backgrounds?

The people behind The Lincolnite are Daniel Ionescu (Managing Editor), Elizabeth Fish (Associate Editor), and Chris Brandrick (Senior Editor). Daniel and Elizabeth are journalism graduates from the University of Lincoln, while Chris is a Web Technology graduate from the same institution.

Besides our journalism and web technology training, all of us are also freelance writers for several publications, and have run the award-winning student newspaper at the University of Lincoln for two years.

We also have several contributors and freelancers on board.

What made you decide to set up The Lincolnite?

The idea was something I had at the back of my mind for a couple of years. I believe hyperlocal can be one of the strengths of small independent media outlets, and Lincoln was missing such a publication.

The small city (approx. 100,000 people) is served by county-wide media (one newspaper and two radio stations), yet no dedicated local news source existed. So The Lincolnite came to fill a gap in the market in the city — a news website dedicated to covering only Lincoln. Continue reading

John Rentoul, Media Oops Number 1 : You cannot close the door once a blog post has bolted

John Rentoul of the Independent has the blog with the longest running single-blog meme in the known world. “Questions to which the answer is no” is now up to number 411 (“Will Barclays carry out its threat to leave UK?“),

I can’t compete with that, so I thought I’d start a list of Media Oops-es, i.e., cockups. This is all in the interest of media transparency, you understand. Shooting from the hip is just as big a problem for blogging journalists as it is for rednecks and Harriet Harman – though I suspect her invective was planned.

(Update: since this is about educating student journalists, I thought I would cross-post to the Online Journalism Blog in addition to the Wardman Wire).

The first one comes via Justin McKeating, who’s doing something slightly similar, though I suspect we’ll be tracking different bits of media silliness.

Rentoul came up with a slightly unflattering comparison:

A friend draws my attention to a resemblance I had not noticed.

Ed Miliband, he says, reminds him of Watto, the hovering, scuzzy garage owner on Tatooine who enslaves little boys in Star Wars Episode I: The Phantom Menace, my favourite film of the six.

Miliband spoke in his speech to Labour conference of his being compared to “Wallace out of Wallace and Gromit” – although he department from the text issued, “I can see the resemblance”, to say: “I gather some people can see the resemblance.”

But I thought he looked more like Gromit – the dog who is cleverer than his master who expresses himself mainly by his eyebrows.

If he’d just left it there none of us would have made a fuss. But he thought better of it and deleted the piece. As Justin says:

It looks like the mighty John Rentoul thought better of comparing Ed Miliband to the Watto character from The Phantom Menace and pulled the post without comment. You now get a ‘page not found’ error when you click on the link. Particularly piquant was when Rentoul noted Watto is ‘scuzzy’ and ‘enslaves little boys’. And he deleted his tweet advertising his insightful blog post (we know it was there because somebody replied to it). What a shame, denying future students of journalism this exemplary example of the craft.

Who am I to deny an education to students of journalism? I love computer networks with memories; and also search engines with caches.

20101030-media-oops-john-rentoul-ed-milliband-watto

For the record, here’s the Milliman, who Rentoul (and everybody else) has previously compared to a panda:

q-photo-ed-miliband

The best bit is that the next Rentoul blog post was all about “tasteless metaphors“.

Pot. Kettle. White and black.

(Update: since this is about educating student journalists, I thought I would cross-post to the Online Journalism Blog).

Do bloggers devalue journalism?

Science journalist Angela Saini has written an interesting post on ‘devaluing journalism’ that I felt I had to respond to. “The profession [of journalism] is being devalued,” she argues.

“Firstly, by magazines and newspapers that are turning to bloggers for content instead of experienced journalists. And secondly, by people who are willing to work for free or for very little (interns, bloggers, cut-price freelancers). Now this is fine if you’re just running your own site in your spare time, but the media is always going to suffer if journalists don’t demand fair pay for doing real stories. Editors will get away with undercutting their writers. Plus, they’ll be much keener to employ legions of churnalists on the cheap. In the long-run, the quality of stories will fall.”

Firstly let me say that I broadly agree with most of what Angela is saying: that full time journalists offer something that other participants in journalism do not; and that publishers and editors see interns and bloggers as sources of cheap content. I also strongly support interns being paid.

But I think Angela mixes economic value with editorial value, and that undermines the general thrust of the argument.

What reduces the value of something economically? Angela’s argument seems to rest on the idea of increased supply. And indeed, entry-level journalism wages have been consistently depressed partly as a result of increasing numbers of people who want to be journalists and who will work for free, or for low wages – but also partly because of the demands of and pressures on the industry itself.

UPDATE: Ben Mazzotta fleshes out the subtleties of the economics above  nicely, although I think he misinterprets the point I’m trying to make.

“Although entry-level journalists are badly paid, that doesn’t necessarily have anything to do with the economics of Nicholas Kristof’s salary. Kristof’s pay goes up or down based on what the papers can afford, which is driven by subscriptions and advertising. In fact, the more liars and bad writers are out there with healthy audiences, the bigger is the pie for the best journalists to fight over. Effectively, that’s just a few million more hacks that Kristof is better than. The best columns in journalism are a classic positional good: their worth is determined by how much better they are than their competitors.”

Editorial value, not economic

Angela’s point, however, is not about the economic value of professional journalism but the editorial value – the quality, not the quantity.

There’s an obvious link between the two. Pay people very little, and they won’t stick around to become better reporters (witness how many journalists leave the profession for PR as soon as they have families to feed). Rely on interns and you not only have a more unskilled workforce but the skilled part of your workforce has to spend part of its time doing informal ‘training’ of those interns.

So where do bloggers come in? Angela mentions them in two senses: firstly as being chosen over experienced journalists, and second as part of a list of people willing to work for little or for free.

‘Blogger’ is meaningless

But, unlike the labels ‘intern’ and ‘freelance journalist’, ‘blogger’ is a definition by platform not by occupation, and takes in a vast range of people, some of whom are very experienced journalists themselves (with high rates), and some of whom have more specialist expertise than journalists. It also includes aspiring journalists and “cut price freelancers”.

Does their existence ‘devalue’ journalism? Economically, you might argue that it increases the supply of journalism and so drives down its price (I wouldn’t, but you might. That’s not my point).

But editorially? Well, here we have to take in a new factor: bloggers don’t have to write about what publishers tell them to. And most of them don’t. So while the increase in bloggers has expanded the potential market for contributors – it’s also expanded the content competing with your own. Competition – in strictly economic terms – is supposed to drive quality up. But I’m not going to argue that that’s happening, because this is not a market economy we’re looking at, but a mixed one.

I guess my point is that this isn’t a simple either/or calculation any more. The drive to reduce costs and increase profits has always led to the ‘devaluing of journalism’ as a profession. Blogging and the broader ability for anyone to publish does little to change that. What it does do, however, is introduce different dynamics into the picture. When you divorce ‘journalism’ from its commercial face, ‘publishing’, as the internet has done, then you also break down the relationship between economic devaluation and editorial devaluation when it comes to journalism in aggregate.

First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

– there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table

    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])

    if len(members)>20:
        members=members[:20]

    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record)
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1

  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

Hyperlocal voices: Will Perrin, Kings Cross Environment

hyperlocal blogger Will Perrin

Will Perrin has spoken widely about his experiences with www.kingscrossenvironment.com, a site he set up four years ago “as a desperate measure to help with local civic activism”. In the latest in the Hyperlocal Voices series, he explains how news comes far down their list of priorities, and the importance of real world networks.

Who were the people behind the blog, and what were their backgrounds?

I set it up solo in 2006, local campaigner Stephan joined late in 2006 and Sophie shortly thereafter. The three of us write regularly – me a civil servant for most of my time on the site, Sophie an actor, Stephan a retired media executive.

We had all been active in our communities for many years on a range of issues with very different perspectives. There are four or five others who contribute occasionally and a network of 20 or more folk who send us stuff for the site.

What made you decide to set up the blog?

The site was simply a tool to help co-ordinate civic action on the ground. The site was set up in 2006 as a desperate measure to help with local civic activism.

I was totally overwhelmed with reports, documents, minutes of meetings and was generating a lot of photos of broken things on the street. The council had just created a new resident-led committee for me and the burden was going to increase. Also I kept bumping into loads of other people who were active in the community but no one knew what the others were doing. I knew that the internet was a good way of organising information but wasn’t sure how to do it. Continue reading