Category Archives: online journalism

Mapping the New Year Honours List – Where Did the Honours Go?

When I get a chance, I’ll post a (not totally unsympathetic) response to Milo Yiannopoulos’post The pitiful cult of ‘data journalism’, but in the meantime, here’s a view over some data that was released a couple of days ago – a map of where the New Year Honours went [link]

New Year Honours map

[Hmm… so WordPress.com doesn’t seem to want to let me embed a Google Fusion Table map iframe, and Google Maps (which are embeddable) just shows an empty folder when I try to view the Fusion Table KML… (the Fusion Table export KML doesn’t seem to include lat/lng data either? Maybe I need to explore some hosting elsewhere this year…]

Note that I wouldn’t make the claim that this represents an example of data journalism. It’s a sketch map showing which parts of the country various recipients of honours this time round presumably live. Just by posting the map, I’m not reporting any particular story. Instead, I’m trying to find a way of looking at the day to see whether or not there may be any interesting stories that are suggested by viewing the data in this way.

There was a small element of work involved in generating the map view, though… Working backwards, when I used Google Fusion tables to geocode the locations of the honoured, some of the points were incorrectly located:

Google Fusion Tables - correcting fault geocoding

(It would be nice to be able to force a locale to the geocoder, maybe telling it to use maps.google.co.uk as the base, rather than (presumably) maps.google.com?)

The approach I took to tidying these was rather clunky, first going into the table view and filtering on the mispositioned locations:

Google Fusion Tables - correcting geocoding errors

Then correcting them:

Google Fusion Table, Correct Geocode errors

What would be really handy would be if Google Fusion Tables let you see a tabular view of data within a particular map view – so for example, if I could zoom in to the US map and then get a tabular view of the records displayed on that particular local map view… (If it does already support this and I just missed it, please let me know via the comments..;-)

So how did I get the data into Google Fusion Tables? The original data was posted as a PDF on the DirectGov website (New Year Honours List 2012 – in detail)…:

New Year Honours data

…so I used Scraperwiki to preview and read through the PDF and extract the honours list data (my scraper is a little clunky and doesnlt pull out 100% of the data, missing the occasional name and contribution details when it’s split over several lines; but I think it does a reasonable enough job for now, particularly as I am currently more interested in focussing on the possible high level process for extracting and manipulating the data, rather than the correctness of it…!;-)

Here’s the scraper (feel free to improve upon it….:-): Scraperwiki: New Year Honours 2012

I then did a little bit of tweaking in Google Refine, normalising some of the facets and crudely attempting to separate out each person’s role and the contribution for which the award was made.

For example, in the case of Dr Glenis Carole Basiro DAVEY, given column data of the form “The Open University, Science Faculty and Health Education and Training Programme, Africa. For services to Higher and Health Education.“, we can use the following expressions to generate new sub-columns:

value.match(/.*(For .*)/)[0] to pull out things like “For services to Higher and Health Education.”
value.match(/(.*)For .*/)[0] to pull out things like “The Open University, Science Faculty and Health Education and Training Programme, Africa.”

I also ran each person’s record through Reuters Open Calais service using Google Refine’s ability to augment data with data from a URL (“Add column by fetching URLs”), pulling the data back as JSON. Here’s the URL format I used (polling once every 500ms in order to stay with the max. 4 calls per limit threshold mandated by the API.)

"http://api.opencalais.com/enlighten/rest/?licenseID=<strong>MY_LICENSE_KEY</strong>&content=" + escape(value,'url') + "&paramsXML=%3Cc%3Aparams%20xmlns%3Ac%3D%22http%3A%2F%2Fs.opencalais.com%2F1%2Fpred%2F%22%20xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22%3E%20%20%3Cc%3AprocessingDirectives%20c%3AcontentType%3D%22TEXT%2FRAW%22%20c%3AoutputFormat%3D%22Application%2FJSON%22%20%20%3E%20%20%3C%2Fc%3AprocessingDirectives%3E%20%20%3Cc%3AuserDirectives%3E%20%20%3C%2Fc%3AuserDirectives%3E%20%20%3Cc%3AexternalMetadata%3E%20%20%3C%2Fc%3AexternalMetadata%3E%20%20%3C%2Fc%3Aparams%3E"

Unpicking this a little:

licenseID is set to my license key value
content is the URL escaped version of the text I wanted to process (in this case, I created a new column from the name column that also pulled in data from a second column (the contribution column). The GREL formula I used to join the columns took the form: value+', '+cells["contribution"].value)
paramsXML is the URL encoded version of the following parameters, which set the content encoding for the result to be JSON (the default is XML):

<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="Application/JSON"  >
</c:processingDirectives>
<c:userDirectives>
</c:userDirectives>
<c:externalMetadata>
</c:externalMetadata>
</c:params>

So much for process – now where are the stories? That’s left, for now, as an exercise for the reader. An obvious starting point is just to see who received honours in your locale. Remember, Google Fusion Tables lets you generate all sorts of filtered views, so it’s not too hard to map where the MBEs vs OBEs are based, for example, or have a stab at where awards relating to services to Higher Education went. Some awards also have a high correspondence with a particular location, as for example in the case of Enfield…

If you do generate any interesting views from the New Year Honours 2012 Fusion Table, please post a link in the comments. And if you find a problem with/fix for the data or the scraper, please post that info in a comment too:-)

A Quick Peek at Three Content Analysis Services

A long, long time ago, I tinkered with a hack called Serendipitwitterous (long since rotted, I suspect), that would look through a Twitter stream (personal feed, or hashtagged tweets), use the Yahoo term extraction service to try to identify concepts or key words/phrases in each tweet, and then use these as a search term on Slideshare, Youtube and so on to find content that may or may not be loosely related to each tweet.

The Yahoo Term Extraction is still hanging in there – just – but I think it finally gets deprecated early next year. From my feeds today, however, it seems there may be a replacement in the form of a new content analysis service via YQL – Yahoo! Opens Content Analysis Technology to all Developers:

[The Y! COntent Analysis service will] extract key terms from the content, and, more importantly, rank them based on their overall importance to the content. The output you receive contains the keywords and their ranks along with other actionable metadata.
On top of entity extraction and ranking, developers need to know whether key terms correspond to objects with existing rich metadata. Having this entity/object connection allows for the creation of highly engaging user experiences. The Y! Content Analysis output provides related Wikipedia IDs for key terms when they can be confidently identified. This enables interoperability with linked data on the semantic Web.

What this means is that you can push a content feed through the service, and get an annotated version out that includes identifier based hooks into other domains (i.e. little-l, little-d linked data). You can find the documentation here: Content Analysis Documentation for Yahoo! Search

So how does it fare? As I’ve previously explored using the Reuters Open Calais service to annotate OU/BBC programme listings (e.g. Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags), I thought I’d use a programme feed from The Bottom Line again…

To start, we need to open the YQL developer console: http://developer.yahoo.com/yql/console/

We can then pull in an example programme description from the BBC using a YQL query of the form:

select long_synopsis from xml where url='http://www.bbc.co.uk/programmes/b00vy3l1.xml'

Grabbing a BBC programme feed into YQL

For reference, the text looks like this:

The view from the top of business. Presented by Evan Davis, The Bottom Line cuts through confusion, statistics and spin to present a clearer view of the business world, through discussion with people running leading and emerging companies.
In the week that Facebook launched its own new messaging service, Evan and his panel of top business guests discuss the role of email at work, amid the many different ways of messaging and communicating.
And location, location, location. It’s a cliche that location can make or break a business, but how true is it really? And what are the advantages of being next door to the competition?
Evan is joined in the studio by Chris Grigg, chief executive of property company British Land; Andrew Horton, chief executive of insurance company Beazley; Raghav Bahl, founder of Indian television news group Network 18.
Producer: Ben Crighton
Last in the series. The Bottom Line returns in January 2011.

The content analysis query example provided looks like this:

select * from contentanalysis.analyze where text="Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration"

but we can nest queries in order to pass the long_synposis from the BBC programme feed through the service:

select * from contentanalysis.analyze where text in (select long_synopsis from xml where url='http://www.bbc.co.uk/programmes/b00vy3l1.xml')

Here’s the result:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="2" yahoo:created="2011-12-22T11:03:51Z" yahoo:lang="en-US">
    <diagnostics>
        <publiclyCallable>true</publiclyCallable>
        <url execution-start-time="2" execution-stop-time="370"
            execution-time="368" proxy="DEFAULT"><![CDATA[http://www.bbc.co.uk/programmes/b00vy3l1.xml]]></url>
        <user-time>572</user-time>
        <service-time>565</service-time>
        <build-version>24402</build-version>
    </diagnostics> 
    <results>
        <categories xmlns="urn:yahoo:cap">
            <yct_categories>
                <yct_category score="0.536">Business &amp; Economy</yct_category>
                <yct_category score="0.421652">Finance</yct_category>
                <yct_category score="0.418182">Finance/Investment &amp; Company Information</yct_category>
            </yct_categories>
        </categories>
        <entities xmlns="urn:yahoo:cap">
            <entity score="0.979564">
                <text end="57" endchar="57" start="48" startchar="48">Evan Davis</text>
                <wiki_url>http://en.wikipedia.com/wiki/Evan_Davis</wiki_url>
                <types>
                    <type region="us">/person</type>
                    <type region="us">/place/place_of_interest</type>
                    <type region="us">/place/us/town</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Don%27t_Tell_Mama</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Lenny_Dykstra</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Los_Angeles_Police_Department</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Today_%28BBC_Radio_4%29</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Chrisman,_Illinois</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.734099">
                <text end="265" endchar="265" start="258" startchar="258">Facebook</text>
                <wiki_url>http://en.wikipedia.com/wiki/Facebook</wiki_url>
                <types>
                    <type region="us">/organization</type>
                    <type region="us">/organization/domain</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Mark_Zuckerberg</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Social_network_service</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Twitter</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Social_network</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Digital_Sky_Technologies</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.674621">
                <text end="477" endchar="477" start="450" startchar="450">location, location, location</text>
            </entity>
            <entity score="0.651227">
                <text end="79" endchar="79" start="60" startchar="60">The Bottom Line cuts</text>
                <types>
                    <type region="us">/other/movie/movie_name</type>
                </types>
            </entity>
            <entity score="0.646818">
                <text end="799" endchar="799" start="789" startchar="789">Raghav Bahl</text>
                <wiki_url>http://en.wikipedia.com/wiki/Raghav_Bahl</wiki_url>
                <types>
                    <type region="us">/person</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Network_18</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Superpower</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Deng_Xiaoping</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/The_Amazing_Race</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Hare</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.644349">
                <text end="144" endchar="144" start="133" startchar="133">clearer view</text>
            </entity>
            <entity score="0.54609">
                <text end="675" endchar="675" start="665" startchar="665">Chris Grigg</text>
                <types>
                    <type region="us">/person</type>
                </types>
            </entity>
        </entities>
    </results>
</query>

So, some success in pulling out person names, and limited success on company names. The subject categories look reasonably appropriate too.

[UPDATE: I should have run the desc contentanalysis.analyze query before publishing this post to pull up the docs/examples… As well as the where text= argument, there is a where url= argument that will pul back semantic information about a URL. Running the query over the OU homepage, for example, using select * from contentanalysis.analyze where url=”http://www.open.ac.uk” identifies the OU as an organisation, with links out to Wikipedia, as well as geo-information and a Yahoo woe_id.]

Another related service in this area that I haven’t really explored yet is TSO’s Data Enrichment Service (API).

Here’s how it copes with the same programme synposis:

TSO Data Enrichment Service

Pretty good… and links in to dbpedia (better for machine readability) compared to the Wikipedia links that the Yahoo service offers.

For completeness, here’s what the Reuters Open Calais service comes up with:

OPen Calais - content analysis

The best of the bunch on this sample of one, I think, albeit admittedly in the domain the Reuters focus on?

But so what…? What are these services good for? Automatic metadata generation/extraction is one thing, as I’ve demonstrated in Visualising OU Academic Participation with the BBC’s “In Our Time”, where I generated a quick visualisation that showed the sorts of topics that OU academics had talked about as guests on Melvyn Bragg’s In Our Time, along with the topics that other universities had been engaged with on that programme.

The rise of local media sales partnerships and 19 other recent hyper-local developments you may have missed

In this guest post Ofcom’s Damian Radcliffe cross-publishes his latest presentation on developments in hyperlocal publishing for September-October, and highlights how partnerships are increasingly important for hyper-local, regional and national media in terms of “making it pay”.

When producing my latest bi-monthly update on hyper-local media, I was struck by the fact that media sales partnerships suddenly seem to be all the rage.

In a challenging economic climate, a number of media providers – both big and small – have recently come together to announce initiatives aimed at maximising economies of scale and potentially reducing overheads.

At a hyperlocal level, the launch on 1st November of the Chicago Independent Advertising Network (CIAN), saw 15 Chicago community news sites coming together to offer a single point of contact for advertisers. These sites “collectively serve more than 1 million page views each month.”

This initiative follows in the footsteps of other small scale advertising alliances including the Seattle Indie Ad Network and Boston Blogs.

These moves – bringing together a range of small scale location based websites – can help address concerns that hyper-local sites are not big enough (on their own) to unlock funding from large advertisers.

CIAN also aims to address a further hyper-local concern: that of sales skills. Rather than having a hyperlocal practitioner add media sales to an ever expanding list of duties, funding from the Chicago Community Trust and the Knight Community Information Challenge allows for a full-time salesperson.

Big Media is also getting in on this act.

In early November Microsoft, Yahoo! and AOL agreed to sell each other’s unsold display ads. The move is a response to Google and Facebook’s increasing clout in this space.

Reuters reported that both Facebook and Google are expected to increase their share of online display advertising in the United States in 2011 by 9.3% and 16.3%.

In contrast, AOL, Microsoft and Yahoo are forecast to lose share, with Facebook expected to surpass Yahoo for the first time.

Similarly in the UK, DMGT’s Northcliffe Media, home to 113 regional newspapers, recently announced it was forging a joint partnership with Trinity Mirror’s regional sales house, AMRA.

This will create a commercial proposition encompassing over 260 titles, including nine of the UK’s 10 biggest regional paid-for titles. Like The Microsoft, Yahoo! and AOL arrangement, this new partnership comes into effect in 2012.

These examples all offer opportunities for economies of scale for media outlets and potentially larger potential reach and impact for advertisers.  Given these benefits, I wouldn’t be surprised if we didn’t see more of these types of partnership in the coming months and years.

Damian Radcliffe is writing in a personal capacity.

Other topics in his current hyperlocal slides  include Sky’s local pilot in NE England and research into the links between tablet useand local news consumption. As ever, feedback and suggestions for future editions are welcome.

 

More Dabblings With Local Sentencing Data

In Accessing and Visualising Sentencing Data for Local Courts I posted a couple of quick ways in to playing with Ministry of Justice sentencing data for the period July 2010-June 2011 at the local court level. At the end of the post, I wondered about how to wrangle the data in R so that I could look at percentage-wise comparisons between different factors (Age, gender) and offence type and mentioned that I’d posted a related question to to the Cross Validated/Stats Exchange site (Casting multidimensional data in R into a data frame).

Courtesy of Chase, I have an answer🙂 So let’s see how it plays out…

To start, let’s just load the Isle of Wight court sentencing data into RStudio:

require(ggplot2)
require(reshape2)
iw = read.csv("http://dl.dropbox.com/u/1156404/wightCrimRecords.csv")

Now we’re going to shape the data so that we can plot the percentage of each offence type by gender (limited to Male and Female options):

iw.m = melt(iw, id.vars = "sex", measure.vars = "Offence_type")
iw.sex = ddply(iw.m, "sex", function(x) as.data.frame(prop.table(table(x$value))))
ggplot(subset(iw.sex,sex=='Female'|sex=='Male')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~sex)+ opts(axis.text.x=theme_text(angle=-90)) + xlab('Offence Type')

Here’s the result:

Splitting down offences by percentage and gender

We can also process the data over a couple of variables. So for example, we can look to see how female recorded sentences break down by offence type and age range, displaying the results as a percentage of how often each offence type on its own was recorded by age:

iw.m2 = melt(iw, id.vars = c("sex","Offence_type" ), measure.vars = "AGE")
iw.off=ddply(iw.m2, c("sex","Offence_type"), function(x) as.data.frame(prop.table(table(x$value))))

ggplot(subset(iw.off,sex=='Female')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~Offence_type) + opts(axis.text.x=theme_text(angle=-90)) + xlab('Age Range (Female)')

Offence type broken down by age and gender

Note that this graphic may actually be a little misleading because percentage based reports donlt play well with small numbers…: whilst there are multiple Driving Offences recorded, there are only two Burglaries, so the statistical distribution of convicted female burglars is based over a population of size two… A count would be a better way of showing this

PS I was hoping to be able to just transmute the variables and generate a raft of other charts, but I seem to be getting an error, maybe because some rows are missing? So: anyone know where I’m supposed to post R library bug reports?

Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

– the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
– the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

head recordlevel.csv

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Isle fo wight sentencing data

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

Visualising data in google fusion tables

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Data exploration in Google FUsion tables

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

recordlevel <- read.csv("~/data/recordlevel.csv")
iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

R - bar plot

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw$AGE, iw$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R barplot - offences on IW

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R - stacked bar chart

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

GGPlot2 in R

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

ggplot facet barplot

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

ggplot with stacked factor

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below 😉

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

yeroon.net/ggplot2

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])

We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)

Finding Common Terms around a Twitter Hashtag

@aendrew sent me a link to a StackExchange question he’s just raised, in a tweet asking: “Anyone know how to find what terms surround a Twitter trend/hashtag?”

I’ve dabbled in this area before, though not addressing this question exactly, using Yahoo Pipes to find what hashtags are being used around a particular search term (Searching for Twitter Hashtags and Finding Hashtag Communities) or by members of a particular list (What’s Happening Now: Hashtags on Twitter Lists; that post also links to a pipe that identifies names of people tweeting around a particular search term.).

So what would we need a pipe to do that finds terms surrounding a twitter hashtag?

Firstly, we need to search on the tag to pull back a list of tweets containing that tag. Then we need to split the tweets into atomic elements (i.e. separate words). At this point, it might be useful to count how many times each one occurs, and display the most popular. We might also need to generate a “stop list” containing common words we aren’t really interested in (for example, the or and.

So here’s a quick hack at a pipe that does just that (Popular words round a hashtag).

For a start, I’m going to construct a string tokeniser that just searches for 100 tweets containing a particular search term, and then splits each tweet up in separate words, where words are things that are separated by white space. The pipe output is just a list of all the words from all the tweets that the search returned:

Twitter string tokeniser

You might notice the pipe also allows us to choose which page of results we want…

We can now use the helper pipe in another pipe. Firstly, let’s grab the words from a search that returns 200 tweets on the same search term. The helper pipe is called twice, once for the first page of results, once for the second page of results. The wordlists from each search query are then merged by the union block. The Rename block relabels the .content attribute as the .title attribute of each feed item.

Grab 200 tweets and check we have set the title element

The next thing we’re going to do is identify and count the unique words in the combined wordlist using the Unique block, and then sort the list accord to the number of times each word occurs.

Preliminary parsing of a wordlist

The above pipe fragment also filters the wordlist so that only words containing alphabetic characters are allowed through, as well as words with four or more characters. (The regular expression .{4,} reads: allow any string of four or more ({4,}) characters of any type (.). An expression .{5,7} would say – allow words through with length 5 to 7 characters.)

I’ve also added a short routine that implements a stop list. The regular expression pattern (?i)b(word1|word2|word3)b says: ignoring case ((?i)),try to match any of the words word1, word2, word3. (b denotes word boundary.) Note that in the filter below, some of the words in my stop list are redundant (the ones with three or fewer characters. Remember, we have already filtered the word list to show only words of length four or more characters.)

Stop list

I also added a user input that allows additional stop terms to be added (they should be pipe (|) separated, with no spaces between them). You can find the pipe here.

Strategies vs tools redux

Yesterday I chaired a panel on ‘UGC and Social Media’ at Birmingham’s Hello Culture event. Determined that it did not descend into the all-too-common obsession with tools that often characterises such discussions, I framed it from the start with the questions “Why should we care? Why should users care?”

The panellists were grateful – and the tactic seemed to work. We talked about the tension between creating content and building relationships; between the urge to ‘get people on our platform’ and going to their platforms instead. We discussed how the experience of designing physical spaces might inform how we approach designing digital ones; and about revisiting strategic priorities as a whole instead of simply trying to ‘find time’ to ‘do the online stuff’.

In other words we talked about people rather than technology, and strategies rather than tools.

So this morning it was good to be brought back down to earth and reminded just how embedded the technology-driven mindset is by Richard Millington.

Richard writes about a ‘State of Branded Online Communities’ report that uses Bravo TV as an example of a “successful” online community. The problem is that by any sensible measure, it isn’t. And I think Richard’s quotes on just how flawed the example is are worth reproducing here at length:

“If simply posting a standardized thread each week and leaving people to their own endeavours is seen as good community management practice, what exactly is bad community management? This is community management by autopilot.

“… You judge a community’s success by it’s stage in the life cycle, the number of interactions it generates, it’s members sense of community and the ROI it offers the organization. ComBlu defines success by what features the platform offers. By that assessment, nearly all of the most successful communities would be considered failures. [They struggle to get more than 10 members participating in a community at any one time.]

“ComBlu credits Bravo with an array of successes which have no impact on the community’s success. Only one suggestion is offered:

“[..] On our Bravo wish list? A better gamification or reputation management system.”

“There are a variety of things the community needs, a better gamification system certainly isn’t one of them.

“How about hiring a community manager to take responsibility for stimulating discussions […]?

“… Content sites branded as communities are still content sites.”

Ah, gamification: I’ll tip that to be next year’s QR code/Facebook page. How about an iPhone app? Everyone else is doing it so why shouldn’t we? Remember when everyone had to have a space in Second Life?

It’s a point I’ve made before in Technology is not a strategy: it’s a tool (and its follow-up), and which is explored at length in my Online Journalism book. Too often in an organisation or in a student project someone decides that they must launch a Facebook page or ‘be on Twitter’.

I recently compared this to someone approaching a TV producer, saying they wanted to make a documentary, and explaining that their strategy would be to “use a camera”.

No producer would accept that, and we need an equally critical attitude to the use of new technology. Otherwise we’re just hammers walking around seeing nails.

A case study in crowdsourcing investigative journalism part 7: Conclusions

In the final part of the research underpinning a new Help Me Investigate project I explore the qualities that successful crowdsourcing investigations shared. Previous parts are linked below:

Conclusions

Looking at the reasons that users of the site as a whole gave for not contributing to an investigation, the majority attributed this to ‘not having enough time’. Although at least one interviewee, in contrast, highlighted the simplicity and ease of contributing, it needs to be as easy and simple as possible for users to contribute (or appear to be) in order to lower the perception of effort and time needed.

Notably, the second biggest reason for not contributing was a ‘lack of personal connection with an investigation’, demonstrating the importance of the individual and social dimension of crowdsourcing. Likewise, a ‘personal interest in the issue’ was the single largest factor in someone contributing. A ‘Why should I contribute?’ feature on crowdsourcing projects may be worth considering.

Others mentioned the social dimension of crowdsourcing – the “sense of being involved in something together” – what Jenkins (2006, p244) would refer to as “consumption as a networked practice”, a motivation also identified by Yochai Benkler in his work on networks (2006). Looking at non-financial motivations behind people contributing their time to online projects, he refers to “socio-psychological reward”. He also identifies the importance of “hedonic personal gratification”. In other words, fun.

Although positive feedback formed part of the design of the site, no consideration was paid to negative feedback: users being made aware of when they were not succeeding. This element also appears to be absent from game mechanics in other crowdsourcing experiments such as The Guardian’s MPs’ expenses app.

While it is easy to talk about “Failure for free”, more could be done to identify and support failing investigations. A monthly update feature that would remind users of recent activity and – more importantly – the lack of activity might help here. The investigators in a group might be asked whether they wish to terminate the investigation in those cases, emphasising their responsibility for its progress and helping ‘clean up’ the investigations listed on the first page of the site.

However, there is also a danger in interfering too much in reducing failure. This is a natural instinct, and the establishment of a reasonable ‘success rate’ at the outset – based on the literature around crowdsourcing – helps to counter this. That was part of the design of Help Me Investigate: it was the 1-5% of questions that gained traction that would be the focus of the site. One analogy is a news conference where members throw out ideas – only a few are chosen for investment of time and energy, the rest ‘fail’.

It is the management of that tension between interfering to ensure everything succeeds (and so removing the incentive for users to be self-motivated) and not interfering at all (leaving users feeling unsupported and unmotivated) that is likely to be the key to a successful crowdsourcing project. More than a year into the project, this tension was still being negotiated.

In summing up the research into Help Me Investigate it is possible to identify five qualities which successful investigations shared: ‘Alpha users’ (highly active, who drove investigations forward); modularity (the ability to break down a large investigation into smaller discrete elements); public-ness (the ability for others to find out about an investigation); feedback (game mechanics and the pleasure of using the site); and diversity of users.

Relating these findings to other research into crowdsourcing more generally it is possible to make broader generalisations regarding how future projects might be best organised. Leadbeater (2008, p68), for example, identifies five key principles of successful collaborative projects, summed up as ‘Core’ (directly comparable to the need for alpha users identified in this research); ‘Contribute’ (large numbers, comparable to public-ness); ‘Connect’ (diversity); ‘Collaborate’ (self governance – relating indirectly to modularity); and ‘Create’ (creative pleasure – relating indirectly to feedback). Similar qualities are also identified by US investigative reporter and Knight fellow Wendy Norris in her experiments with crowdsourcing (Lavrusik, 2010).

The most notable connections here are the indirect ones. While the technology of Help Me Investigate allowed for modularity, for example, the community structure was rather flat. Leadbeater’s research (2008) and that of Lih (2009) into the development of Wikipedia and Tsui (2010, PDF) into Global Voices indicate that ‘modularity’ may be part of a wider need for ‘structure’. Conversely ‘feedback’ provides a specific, practical way for crowdsourcing projects to address users’ need for creative pleasure.

As Help Me Investigate reached its 18th month a number of changes were made to test these ideas: the code was released as open source, effectively crowdsourcing the technology itself, and a strategy was adopted to recruit niche community managers who could build expertise in particular fields, along with an advisory board that was similarly diverse. The Help Me Investigate design was replicated in a plugin which would allow anyone running a self-hosted WordPress blog to manage their own version of the site.

This separation of technology from community was a key learning outcome of the project. While the site had solved some of the technical challenges of crowdsourcing and identified the qualities of successful crowdsourced investigation, it was clear that the biggest challenge lay in connecting the increasingly networked communities that wanted to investigate public interest issues – and in a way that was both sustainable and scalable beyond the level of individual investigations.

 

References

  1. Arthur, Charles. Forecasting is a notoriously imprecise science – ask any meteorologist, January 29 2010, The Guardian, http://www.guardian.co.uk/technology/2010/jan/29/apple-ipad-crowdsource accessed 14/3/2011
  2. Beckett, Charlie (2008) SuperMedia, Oxford: Blackwell
  3. Belam, Martin. Whatever Paul Waugh thinks, The Guardian’s MPs Expenses crowd-sourcing experiment was no “total failure”, Currybetdotnet, March 10 2010 http://www.currybet.net/cbet_blog/2010/03/whatever-paul-waugh-thinks-the.php accessed 14/3/2011
  4. Belam, Martin. Abort? Retry? Fail? – Judging the success of the Guardian’s MP’s expenses app, Currybetdotnet, March 7 2011, http://www.currybet.net/cbet_blog/2011/03/guardian-mps-expenses-success.php accessed 14/3/2011
  5. Belam, Martin. The Guardian’s Paul Lewis on crowd-sourcing investigative journalism with Twitter, Currybetdotnet, March 10 2011, http://www.currybet.net/cbet_blog/2011/03/paul-lewis-investigative-journalism-twitter.php accessed 14/3/2011
  6. Benkler, Yochai (2006) The Wealth of Networks, New Haven: Yale University Press
  7. Bonomolo, Alessandra. Repubblica.it’s experiment with “Investigative reporting on demand”, Online Journalism Blog, March 21 2011, https://onlinejournalismblog.com/2011/03/21/repubblica-its-experiment-with-investigative-reporting-on-demand/ accessed 23/3/2011
  8. Bradshaw, Paul. Wiki Journalism: Are wikis the new blogs? Paper presented to The Future of Journalism conference, Cardiff University, September 2007, https://onlinejournalismblog.com/wp-content/uploads/2007/09/wiki_journalism.pdf
  9. Bradshaw, Paul. The Guardian’s tool to crowdsource MPs’ expenses data: time to play, Online Journalism Blog, June 19 2009 https://onlinejournalismblog.com/2009/06/19/the-guardian-build-a-platform-to-crowdsource-mps-expenses-data/ accessed 14/3/2011
  10. Brogan, C., & Smith, J. (2009). Trust Agents: Using the Web to Build Influence, Improve
  11. Reputation, and Earn Trust (1 ed.), New Jersey: Wiley
  12. Bruns, Axel (2005) Gatewatching, New York: Peter Lang
  13. Bruns, Axel (2008) Blogs, Wikipedia, Second Life and Beyond, New York: Peter Lang
  14. De Burgh, Hugo (2008) Investigative Journalism, London: Routledge
  15. Dondlinger, Mary Jo. Educational Video Game Design: A Review of the Literature, Journal of Applied Educational Technology Volume 4, Number 1, Spring/Summer 2007, http://www.eduquery.com/jaet/JAET4-1_Dondlinger.pdf
  16. Ellis, Justin. A perpetual motion machine for investigative reporting: CPI and PRI partner on state corruption project, Nieman Journalism Lab, March 8 2011 http://www.niemanlab.org/2011/03/a-perpetual-motion-machine-for-investigative-reporting-cpi-and-pri-partner-on-state-corruption-project/ accessed 21/3/2011
  17. Graham, John. Feedback in Game Design, Wolfire Blog, April 21 2010 http://blog.wolfire.com/2010/04/Feedback-In-Game-Design accessed 14/3/2011
  18. Grey, Stephen (2006) Ghost Plane, London: C Hurst & Co
  19. Hickman, Jon. Help Me Investigate: the social practices of investigative journalism, Paper presented to the Media Production Analysis Working Group, IAMCR, Braga, 2010, http://theplan.co.uk/help-me-investigate-the-social-practices-of-i
  20. Howe, Jeff. Gannett to Crowdsource News, Wired, November 3 2006, http://www.wired.com/software/webservices/news/2006/11/72067 accessed 14/3/2011
  21. Jenkins, Henry (2006) Convergence Culture, New York: New York University Press
  22. Lavrusik, Vadim. How Investigative Journalism Is Prospering in the Age of Social Media, Mashable, November 24 2010, http://mashable.com/2010/11/24/investigative-journalism-social-web/ accessed 14/3/2011
  23. Leadbeater (2008) We-Think, London: Profile Books
  24. Leigh, David. Help us solve the mystery of Blair’s money, The Guardian, December 1 2009, http://www.guardian.co.uk/politics/2009/dec/01/help-us-solve-blair-mystery accessed 14/3/2011
  25. Lih, Andrew (2009) The Wikipedia Revolution, London: Aurum Press
  26. Marshall, Sarah. Snow map developer creates ‘Cutsmap’ for Channel 4’s budget coverage, Journalism.co.uk, 22 March 2011, http://www.journalism.co.uk/news/snow-map-developer-creates-cutsmap-for-channel-4-s-budget-coverage/s2/a543335/ accessed 22/3/2011
  27. Morozov, Evgeny (2011) The Net Delusion, London: Allen Lane
  28. Nielsen, Jakob. Participation Inequality: Encouraging More Users to Contribute, Jakob Nielsen’s Alertbox, October 9, 2006, http://www.useit.com/alertbox/participation_inequality.html accessed 14/3/2011
  29. Paterson and Domingo (2008) Making Online News: The Ethnography of New Media Production, New York: Peter Lang
  30. Porter, Joshua (2008) Designing for the Social Web, Berkeley: New Riders
  31. Raymond, Eric S. (1999) The Cathedral and the Bazaar, New York: O’Reilly
  32. Scotney, Tom. Help Me Investigate: How working collaboratively can benefit journalists, Journalism.co.uk, August 14 2009, http://www.journalism.co.uk/news-features/help-me-investigate-how-working-collaboratively-can-benefit-journalists/s5/a535469/ accessed 21/3/2011
  33. Shirky, Clay (2008) Here Comes Everybody, London: Allen Lane
  34. Snyder, Chris. Spot.Us Launches Crowd-Funded Journalism Project, Wired, November 10, 2008, http://www.wired.com/epicenter/2008/11/spotus-launches/ accessed 21/3/2011
  35. Surowiecki, James (2005) The Wisdom of Crowds, London: Abacus
  36. Tapscott, Don & Williams, Anthony (2006) Wikinomics, London: Atlantic Books
  37. Tsui, Lokman. A Journalism of Hospitality, unpublished thesis, Presented to the Faculties of the University of Pennsylvania, 2010 http://dl.dropbox.com/u/22048/Tsui-Dissertation-Deposit-Final.pdf accessed 14/3/2011
  38. Weinberger, David (2002) Small Pieces, Loosely Joined, New York: Basic Books

What made the crowdsourcing successful? A case study in crowdsourcing investigative journalism part 6

In the penultimate part of the serialisation of research underpinning a new Help Me Investigate project I explore the qualities that successful crowdsourcing investigations shared. Previous parts are linked below:

What made the crowdsourcing successful?

Clearly, a distinction should be made between what made the investigation successful as a series of outcomes, and what made crowdsourcing successful as a method for investigative reporting. This section concerns itself with the latter.

What made the community gather, and continue to return? One hypothesis was that the nature of the investigation provided a natural cue to interested parties – The London Weekly was published on Fridays and Saturdays and there was a build up of expectation to see if a new issue would indeed appear.

The data, however, did not support this hypothesis. There was indeed a rhythm but it did not correlate to the date of publication. Wednesdays were the most popular day for people contributing to the investigation.

Upon further investigation a possible explanation was found: one of the investigation’s ‘alpha’ contributors – James Ball – had set himself a task to blog about the investigation every week. His blog posts appeared on a Wednesday.

That this turned out to be a significant factor in driving activity suggests one important lesson: talking publicly and regularly about the investigation’s progress is key to its activity and success.

This data was backed up from the interviews. One respondent mentioned the “weekly cue” explicitly. And Jon Hickman’s research also identified that investigation activity related to “events and interventions. Leadership, especially by staffers, and tasking appeared to be the main drivers of activity within the investigation.” (2010, p10)

He breaks down activity on the site into three ‘acts’, although their relationship to the success of the investigation is not explored further:

  • ‘Brainstorm’ (an initial flurry of activity, much of which is focused on scoping the investigation and recruiting)
  • ‘Consolidation’ (activity is driven by new information)
  • ‘Long tail’ (intermittent caretaker activity, such as supportive comments or occasional updates)

Networked utility

Hickman describes the site as a “centralised sub-network that suits a specific activity” (2010, p12). Importantly, this sub-network forms part of a larger ‘network of networks’ which involves spaces such as users’ blogs, Twitter, Facebook, email and other platforms and channels.

“And yet Help Me Investigate still provided a useful space for them to work within; investigators and staffers feel that the website facilitates investigation in a way that their other social media tools could not:

““It adds the structure and the knowledge base; the challenges, integration with ‘what do they know’ ability to pose questions allows groups to structure an investigation logically and facilitates collaboration.” (Interview with investigator)” (Hickman, 2010, p12)

In the London Weekly investigation the site also helped keep track of a number of discussions taking place around the web. Having been born from a discussion on Twitter, further conversations on Twitter resulted in further people signing up, along with comments threads and other online discussion. This fit the way the site was designed culturally – to be part of a network rather than asking people to do everything on-site.

The presence of ‘alpha’ users like James and Judith was crucial in driving activity on the site – a pattern observed in other successful investigations. They picked up the threads contributed by others and not only wove them together into a coherent narrative that allowed others to enter more easily, but also set the new challenges that provided ways for people to contribute. The fact that they brought with them a strong social network presence is probably also a factor – but one that needs further research.

The site had been designed to emphasise the role of the user in driving investigations. The agenda is not owned by a central publisher, but by the person posing the question – and therefore the responsibility is theirs as well. This cultural hurdle – towards acknowledging personal power and responsibility – may be the biggest one that the site has to address, and the offer of “failure for free” (Shirky, 2008), allowing users to learn what works and what doesn’t, may support that.

The fact that crowdsourcing worked well for the investigation is worth noting, as it could be broken down into separate parts and paths – most of which could be completed online: “Where does this claim come from?” “Can you find out about this person?” “What can you discover about this company?”. One person, for example, used Google Streetview to establish that the registered address of the company was a postbox. Other investigations that are less easily broken down may be less suitable for crowdsourcing – or require more effort to ensure success.

Momentum and direction

A regular supply of updates provided the investigation with momentum. The accumulation of discoveries provided valuable feedback to users, who then returned for more. In his book on Wikipedia, Andrew Lih (2009 p82) notes a similar pattern – ‘stigmergy’ – that is observed in the natural world: “The situation in which the product of previous work, rather than direct communication [induces and directs] additional labour”. An investigation without these ‘small pieces, loosely joined’ (Weinberger, 2002) might not suit crowdsourcing so well.

Hickman’s interviews with participants in the Birmingham council website investigation found a feeling of the investigation being communally owned and led:

“Certain members were good at driving the investigation forward, helping decide on what to do next, but it did not feel like anyone was in charge as such.”

“I’d say HMI had pivital role in keeping us together and focused but it felt owned by everyone.” (Hickman 2010, p10)

One problem, however, was that the number of diverging paths led to a range of potential avenues of enquiry. In the end, although the core questions were answered (was the publication a hoax and what were the bases for their claims) the investigation raised many more questions. These remained largely unanswered once the majority of users felt that their questions had been answered. As in a traditional investigation, there came a point at which those involved had to make a judgement whether they wished to invest any more time in it.

Finally, the investigation benefited from a diverse group of contributors who contributed specialist knowledge or access. Some physically visited stations where the newspaper was claiming distribution to see how many copies were being handed out. Others used advanced search techniques to track down details on the people involved and the claims being made, or to make contact with people who had had previous experiences with those behind the newspaper. The visibility of the investigation online also led to more than one ‘whistleblower’ approach providing inside information, which was not published on the site but resulted in new challenges being set.

The final part of this series outlines some conclusions to be taken from the project, and where it plans to go next.

What are the characteristics of a crowdsourced investigation? A case study in crowdsourcing investigative journalism part 5

Continuing the serialisation of the research underpinning a new Help Me Investigate project, in this fifth part I explore the characteristics of crowdsourcing outlined in the literature. Previous parts are linked below:

What are the characteristics of a crowdsourced investigation?

Tapscott and Williams (2006, p269) explore a range of new models of collaboration facilitated by online networks across a range of industries. These include:

  • Peer producers creating “products made of bits – from operating systems to encyclopedias”
  • “Ideagoras … a global marketplace of ideas, innovations and uniquely qualified minds”
  • Prosumer – ‘professional consumer’ – communities which can produce value if given the right tools by companies
  • Collaborative science (“The New Alexandrians”)
  • Platforms for participation
  • “Global plant floors” – physical production lines split across countries
  • Wiki workplaces which cut across organisational hierarchies

Most of these innovations have not touched the news industry, and some – such as platforms for participation – are used in publishing, but rarely in news production itself (an exception here can be made for a few magazine communities, such as Reed Business Information’s Farmer’s Weekly).

Examples of explicitly crowdsourced journalism can be broadly classified into two types. The first – closest to the ‘Global plant floors’ described above – can be described as the ‘Mechanical Turk’ model (after the Amazon-owned web service that allows you to offer piecemeal payment for repetitive work). This approach tends to involve large numbers of individuals performing small, similar tasks. Examples from journalism would include The Guardian’s experiment with inviting users to classify MPs’ expenses in order to find possible stories, or the pet food bloggers inviting users to add details of affected pets to their database.

The second type – closest to the ‘peer producers’ model – can be described as the ‘Wisdom of Crowds’ approach (after James Surowiecki’s 2005 book of the same name). This approach tends to involve smaller numbers of users performing discrete tasks that rely on a particular expertise. It follows the creed of open source software development, often referred to as Linus’ Law, which states that: “Given enough eyeballs, all bugs are shallow” (Raymond, 1999). The Florida News Press example given above fits into this category, relying as it did on users with specific knowledge (such as engineering or accounting) or access. Another example – based explicitly on examples in Surowiecki’s book – is that of an experiment by The Guardian’s Charles Arthur to predict the specifications of Apple’s rumoured tablet (Arthur, 2010). Over 10,000 users voted on 13 questions, correctly predicting its name, screen size, colour, network and other specifications – but getting other specifications, such as its price, wrong.

Help Me Investigate fits into the ‘Wisdom of Crowds’ category: rather than requiring users to complete identical tasks, the technology splits investigations into different ‘challenges’. Users are invited to tag themselves so that it is easier to locate users with particular expertise (tagged ‘FOI’ or ‘lawyer’ for example) or in a particular location, and many investigations include a challenge to ‘invite an expert’ from a particular area that is not represented in the group of users.

Some elements of Tapscott and Williams’s list can also be related to Help Me Investigate’s processes: for example, the site itself was a ‘platform for participation’ which allowed users from different professions to collaborate without any organisational hierarchy. There was an ‘ideagora’ for suggesting ways of investigating, and the resulting stories were examples of peer production.

One of the first things the research analysed was whether the investigation data matched up to patterns observed elsewhere in crowdsourcing and online activity. An analysis of the number of actions by each user, for example, showed a clear ‘power law’ distribution, where a minority of users accounted for the majority of activity.

This power law, however, did not translate into a breakdown approaching the 90-9-1 ‘law of participation inequality’ observed by Jakob Nielsen (2006). Instead, the balance between those who made a couple of contributions (normally the 9% of the 90-9-1 split) and those who made none (the 90%) was roughly equal. This may have been because the design of the site meant it was not possible to ‘lurk’ without being a member of the site already, or being invited and signing up. Adding in data on those looking at the investigation page who were not members may shed further light on this.

In Jon Hickman’s ethnography of a different investigation (into the project to deliver a new website for Birmingham City Council) he found a similar pattern: of the 32 ‘investigators’, thirteen did nothing more than join the investigation. Others provided “occasional or one-off contributions”, and a few were “prolific” (Hickman, 2010, p10). Rather than being an indication of absence, however, Hickman notes the literature on lurking that suggests it provides an opportunity for informal learning. He identifies support for this in his interviews with lurkers on the site:

“One lurker was a key technical member of the BCC DIY collective: the narrative within Help Me Investigate suggested a low level of engagement with the process and yet this investigator was actually quite prominent in terms of their activism; the lurker was producing pragmatic outcomes and responses to the investigation, although he produced no research for the project. On a similar note, several of the BCC DIY activists were neither active nor lurking within Help Me Investigate. For example, one activist’s account of BCC DIY shows awareness of, and engagement with, the connection between the activist activity and the investigation, even though he is not an active member of the investigation within Help Me Investigate.” (Hickman, 2010, p17)

In the next part I explore what qualities made for successful crowdsourcing in the specific instance of Help Me Investigate.