Monthly Archives: November 2011

What are the characteristics of a crowdsourced investigation? A case study in crowdsourcing investigative journalism part 5

Continuing the serialisation of the research underpinning a new Help Me Investigate project, in this fifth part I explore the characteristics of crowdsourcing outlined in the literature. Previous parts are linked below:

What are the characteristics of a crowdsourced investigation?

Tapscott and Williams (2006, p269) explore a range of new models of collaboration facilitated by online networks across a range of industries. These include:

  • Peer producers creating “products made of bits – from operating systems to encyclopedias”
  • “Ideagoras … a global marketplace of ideas, innovations and uniquely qualified minds”
  • Prosumer – ‘professional consumer’ – communities which can produce value if given the right tools by companies
  • Collaborative science (“The New Alexandrians”)
  • Platforms for participation
  • “Global plant floors” – physical production lines split across countries
  • Wiki workplaces which cut across organisational hierarchies

Most of these innovations have not touched the news industry, and some – such as platforms for participation – are used in publishing, but rarely in news production itself (an exception here can be made for a few magazine communities, such as Reed Business Information’s Farmer’s Weekly).

Examples of explicitly crowdsourced journalism can be broadly classified into two types. The first – closest to the ‘Global plant floors’ described above – can be described as the ‘Mechanical Turk’ model (after the Amazon-owned web service that allows you to offer piecemeal payment for repetitive work). This approach tends to involve large numbers of individuals performing small, similar tasks. Examples from journalism would include The Guardian’s experiment with inviting users to classify MPs’ expenses in order to find possible stories, or the pet food bloggers inviting users to add details of affected pets to their database.

The second type – closest to the ‘peer producers’ model – can be described as the ‘Wisdom of Crowds’ approach (after James Surowiecki’s 2005 book of the same name). This approach tends to involve smaller numbers of users performing discrete tasks that rely on a particular expertise. It follows the creed of open source software development, often referred to as Linus’ Law, which states that: “Given enough eyeballs, all bugs are shallow” (Raymond, 1999). The Florida News Press example given above fits into this category, relying as it did on users with specific knowledge (such as engineering or accounting) or access. Another example – based explicitly on examples in Surowiecki’s book – is that of an experiment by The Guardian’s Charles Arthur to predict the specifications of Apple’s rumoured tablet (Arthur, 2010). Over 10,000 users voted on 13 questions, correctly predicting its name, screen size, colour, network and other specifications – but getting other specifications, such as its price, wrong.

Help Me Investigate fits into the ‘Wisdom of Crowds’ category: rather than requiring users to complete identical tasks, the technology splits investigations into different ‘challenges’. Users are invited to tag themselves so that it is easier to locate users with particular expertise (tagged ‘FOI’ or ‘lawyer’ for example) or in a particular location, and many investigations include a challenge to ‘invite an expert’ from a particular area that is not represented in the group of users.

Some elements of Tapscott and Williams’s list can also be related to Help Me Investigate’s processes: for example, the site itself was a ‘platform for participation’ which allowed users from different professions to collaborate without any organisational hierarchy. There was an ‘ideagora’ for suggesting ways of investigating, and the resulting stories were examples of peer production.

One of the first things the research analysed was whether the investigation data matched up to patterns observed elsewhere in crowdsourcing and online activity. An analysis of the number of actions by each user, for example, showed a clear ‘power law’ distribution, where a minority of users accounted for the majority of activity.

This power law, however, did not translate into a breakdown approaching the 90-9-1 ‘law of participation inequality’ observed by Jakob Nielsen (2006). Instead, the balance between those who made a couple of contributions (normally the 9% of the 90-9-1 split) and those who made none (the 90%) was roughly equal. This may have been because the design of the site meant it was not possible to ‘lurk’ without being a member of the site already, or being invited and signing up. Adding in data on those looking at the investigation page who were not members may shed further light on this.

In Jon Hickman’s ethnography of a different investigation (into the project to deliver a new website for Birmingham City Council) he found a similar pattern: of the 32 ‘investigators’, thirteen did nothing more than join the investigation. Others provided “occasional or one-off contributions”, and a few were “prolific” (Hickman, 2010, p10). Rather than being an indication of absence, however, Hickman notes the literature on lurking that suggests it provides an opportunity for informal learning. He identifies support for this in his interviews with lurkers on the site:

“One lurker was a key technical member of the BCC DIY collective: the narrative within Help Me Investigate suggested a low level of engagement with the process and yet this investigator was actually quite prominent in terms of their activism; the lurker was producing pragmatic outcomes and responses to the investigation, although he produced no research for the project. On a similar note, several of the BCC DIY activists were neither active nor lurking within Help Me Investigate. For example, one activist’s account of BCC DIY shows awareness of, and engagement with, the connection between the activist activity and the investigation, even though he is not an active member of the investigation within Help Me Investigate.” (Hickman, 2010, p17)

In the next part I explore what qualities made for successful crowdsourcing in the specific instance of Help Me Investigate.

Following the money: making networks visible with HTML5

Network analysis – the ability to map connections between people and organisations – is one branch of data journalism which has enormous potential. But it is also an area which has not yet been particularly well explored, partly because of the lack of simple tools with which to do it.

One recent example – AngelsOfTheRight.net – is particularly interesting, because of the way that it is experimenting with HTML5.

The site is attempting to map “relationships among institutions due to the exchange of large quantities of money between them as reported to IRS in a decade of Form 990 tax filings.”

But it’s also attempting to “push the limits” of using HTML5 to create network maps. As this blog post explains:

“This project was built using the NodeViz project […] which wraps up a bunch of the functionality needed to squeeze network ties out of a database, through Graphviz, and into a browser with features like zooming, panning, and full DOM and JavaScript interaction with the rest of the page content. This means that we can do fun things like have a tour to mode a viewer through the map, and have list views of related data alongside the map that will open and focus on related nodes when clicked. It is also supposed to degrade gracefully to just display a clickable image on non-SVG browsers like Internet Explorer 7 and 8.”

HTML5 offers some other interesting possibilities, such as improved search engine optimisation compared to a static image or Flash interactive, although I have no idea how much this project explores that (comments invited).

Also interesting is the discussion section of AngelsOfTheRight.net, which outlines some of the holes in the data, methodological flaws, and ways that the project could be improved:

“In this sort of survey, it is always hard to tell if organizations are missing because they really didn’t make contributions, or just because nobody had time to record the data from their financial statements into the database. Several sources mention the Adolph Coors Foundation as an important funder of the conservative agenda, yet they do not appear in this database. Why not?”

via Pete Warden

A case study in crowdsourcing investigative journalism (part 4): The London Weekly

Continuing the serialisation of the research underpinning a new Help Me Investigate project, in this fourth part I describe how one particular investigation took shape. Previous parts are linked below:

Case study: the London Weekly investigation

In early 2010 Andy Brightwell and I conducted some research into one particular successful investigation on the site. The objective was to identify what had made the investigation successful – and how (or if) those conditions might be replicated for other investigations both on the site and elsewhere online.

The investigation chosen for the case study was ‘What do you know about The London Weekly?’ – an investigation into a free newspaper that was, the owners claimed (part of the investigation was to establish if the claim was a hoax), about to launch in London.

The people behind The London Weekly had made a number of claims about planned circulation, staffing and investment which went unchallenged in specialist media. Journalists Martin Stabe, James Ball and Judith Townend, however, wanted to dig deeper. So, after an exchange on Twitter, Judith logged onto Help Me Investigate and started an investigation.

A month later members of the investigation (most of whom were non-journalists) had unearthed a wealth of detail about the people behind The London Weekly and the facts behind their claims. Some of the information was reported in MediaWeek and The Guardian podcast Media Talk; some formed the basis for posts on James Ball’s blog, Journalism.co.uk and the Online Journalism Blog. Some has, for legal reasons, remained unpublished.

Methodology

Andrew Brightwell conducted a number of semi-structured interviews with contributors to the investigation. The sample was randomly selected but representative of the mix of contributors, who were categorised as either ‘alpha’ contributors (over 6 contributions), ‘active’ (2-6 contributions) and ‘lurkers’ (whose only contribution was to join the investigation). These interviews formed the qualitative basis for the research.

Complementing this data was quantitative information about users of the site as a whole. This was taken from two user surveys – one conducted when the site was three months’ old and another at 12 months – and analysis of analytics taken from the investigation (such as numbers and types of actions, frequency, etc.)

In the next part I explore some of the characteristics of a crowdsourced investigation and how these relate to the wider literature around crowdsourcing in general.

Crowdsourcing investigative journalism: a case study (part 3)

Continuing the serialisation of the research underpinning a new Help Me Investigate project, in this third part I describe how the focus of the site was shaped by the interests of its users and staff, and how site functionality was changed to react to user needs. I also identify some areas where the site could have been further developed and improved. (Part 1 is available here; Part 2 is here)

Reflections on the proof of concept phase

By the end of the 12 week proof of concept phase the site had also completed a number of investigations that were not ‘headline-makers’ but fulfilled the objective of informing users: in particular ‘Why is a new bus company allowed on an existing route with same number, but higher prices?’; ‘What is the tracking process for petitions handed in to Birmingham City Council?’ and ‘The DVLA and misrepresented number plates’

The site had also unearthed some promising information that could provide the basis for more stories, such as Birmingham City Council receiving over £160,000 in payments for vehicle removals; and ‘Which councils in the UK (that use Civil Enforcement) make the most from parking tickets?’ (as a byproduct, this also unearthed how well different councils responded to Freedom of Information requests#)

A number of news organisations expressed an interest in working with the site, but practical contributions to the site took place largely at an individual rather than organisational level. Journalist Tom Scotney, who was involved in one of the investigations, commented: “Get it right and you’re becoming part of an investigative team that’s bigger, more diverse and more skilled than any newsroom could ever be” (Scotney, 2009, n.p.) – but it was becoming clear that most journalists were not culturally prepared – or had the time – to engage with the site unless there was a story ‘ready made’ for them to use. Once there were stories to be had, however, they contributed a valuable role in writing those stories up, obtaining official reactions, and spreading visibility.

After 12 weeks the site had around 275 users (whose backgrounds ranged from journalism and web development to locally active citizens) and 71 investigations, exceeding project targets. It is difficult to measure ‘success’ or ‘failure’ but at least eight investigations had resulted in coherent stories, representing a success rate of at least 11%: the target figure before launch had been 1-5%. That figure rose to around 21% if other promising investigations were included, and the sample included recently initiated investigations which were yet to get off the ground.

‘Success’ was an interesting metric which deserves further elaboration. In his reflection on The Guardian’s crowdsourcing experiment, for example, developer Martin Belam (2011a, n.p.) noted a tendency to evaluate success “not purely editorially, but with a technology mindset in terms of the ‘100% – Achievement unlocked!’ games mechanic.”. In other words, success might be measured in terms of degrees of ‘completion’ rather than results.

In contrast, the newspaper’s journalist Paul Lewis saw success in terms of something other than pure percentages: getting 27,000 people to look at expense claims was, he felt, a successful outcome, regardless of the percentage of claims that those represented. And BBC Special Reports Editor Bella Hurrell – who oversaw a similar but less ambitious crowdsourcing project on the same subject on the broadcaster’s website, felt that they had also succeeded in genuine ‘public service journalism’ in the process (personal interview).

A third measure of success is noted by Belam – that of implementation and iteration (being able to improve the service based on how it is used):

“It demonstrated that as a team our tech guys could, in the space of around a week, get an application deployed into the cloud but appear integrated into our site, using a technology stack that was not our regular infrastructure.

“Secondly, it showed that as a business we could bring people together from editorial, design, technology and QA to deliver a rapid turnaround project in a multi-disciplinary way, based on a topical news story.

“And thirdly, we learned from and improved upon it.“ (Belam, 2010, n.p.)

A percentage ‘success’ rate of Help Me Investigate, then, represents a similar, ‘game-oriented’ perspective on the site, and it is important to draw on other frameworks to measure its success.

For example, it was clear that the site did very well in producing raw material for ‘journalism’, but it was less successful in generating more general civic information such as how to find out who owned a piece of land. Returning to the ideas of Actor-Network Theory outlined above, the behaviour of two principal actors – and one investigation – had a particular influence on this, and how the site more generally developed over time. Site user Neil Houston was an early adopter of the site and one of its heaviest contributors. His interest in interrogating data helped shape the path of many of the site’s most active investigations, which in turn set the editorial ‘tone’ of the site. This attracted users with similar interests to Neil, but may have discouraged others who did not – further research would be needed to establish this.

Likewise, while Birmingham City Council staff contributed to the site in its earliest days, when the council became the subject of an investigation staff’s involvement was actively discouraged (personal interview with contributor). This left the site short of particular expertise in answering civic questions.

At least one user commented that the site was very ‘FOI [Freedom Of Information request]-heavy’ and risked excluding users interested in different types of investigations, or who saw Freedom of Information requests as too difficult for them. This could be traced directly to the appointment of Heather Brooke as the site’s support journalist. Heather is a leading Freedom of Information activist and user of FOI requests: this was an enormous strength in supporting relevant investigations but it should also be recognised how that served to set the editorial tone of the site.

This narrowing of tone was addressed by bringing in a second support journalist with a consumer background: Colin Meek. There was also a strategic shift in community management which involved actively involving users with other investigations. As more users came onto the site these broadened into consumer, property and legal areas.

However, a further ‘actor’ then came into play: the legal and insurance systems. Due to the end of proof of concept funding and the associated legal insurance the team had to close investigations unrelated to the public sector as they left the site most vulnerable legally.

A final example of Actor-Network Theory in action was a difference between the intentions of the site designers and its users. The founders wanted Help Me Investigate to be a place for consensus, not discussion, but it was quickly apparent users did not want to have to go elsewhere to have their discussions. Users needed to – and did – have conversations around the updates that they posted.

The initial challenge-and-result model (breaking investigations down into challenges with entry fields for the subsequent results, which were required to include a link to the source of their information) was therefore changed very early on to challenge-and-update: people could now update without a link, simply to make a point about a previous result, or to explain their efforts in failing to obtain a result.

One of the challenges least likely to be accepted by users was to ‘Write the story up’. It seemed that those who knew the investigation had no need to write it up: the story existed in their heads. Instead it was either site staff or professional journalists who would normally write up the results. Similarly, when an investigation was complete, it required site staff to update the investigation description to include a link to any write-up. There was no evidence of a desire from users to ‘be a journalist’. Indeed, the overriding objective appeared rather to ‘be a citizen’.

In contrast, a challenge to write ‘the story so far’ seemed more appealing in investigations that had gathered data but no resolution as yet. The site founders underestimated the need for narrative in designing a site that allowed users to join investigations while they were in progress.

As was to be expected with a ‘proof of concept’ site (one testing whether an idea could work), there were a number of areas of frustration in the limitations of the site – and identification of areas of opportunity. When looking to crowdfund small amounts for an investigation, for example, there were no third party tools available that would allow this without going through a nonprofit organisation. And when an investigation involved a large crowdsourcing operation the connection to activity conducted on other platforms needed to be stronger so users could more easily see what needed doing (e.g. a live feed of changes to a Google spreadsheet, or documents bookmarked using Delicious).

Finally investigations often evolved into new questions but had to stay with an old title or risk losing the team and resources that had been built up. The option to ‘export’ an investigation team and resources into a fresh question/investigation was one possible future solution.

‘Failure for free’ was part of the design of the site in order to allow investigations to succeed on the efforts of its members rather than as a result of any top-down editorial agenda – although naturally journalist users would concentrate their efforts on the most newsworthy investigations. In practice it was hard to ‘let failure happen’, especially when almost all investigations had some public interest value.

Although the failure itself was not an issue (and indeed the failure rate lower than expected), a ‘safety net’ was needed that would more proactively suggest ways investigators could make their investigation a success, including features such as investigation ‘mentors’ who could pass on their experience; ‘expiry dates’ on challenges with reminders; improved ability to find other investigators with relevant skills or experience; a ‘sandbox’ investigation for new users to find their feet; and developing a metric to identify successful and failing investigations.

Communication was central to successful investigations and two areas required more attention: staff time in pursuing communication with users; and technical infrastructure to automate and facilitate communication (such as alerts to new updates or the ability to mail all investigation members)

The much-feared legal issues threatened by the site did not particularly materialise. Out of over 70 investigations in the first 12 weeks, only four needed rephrasing to avoid being potentially libellous. Two involved minor tweaks; the other two were more significant, partly because of a related need for clarity in the question.

Individual updates within investigations, which were post-moderated, presented even less of a legal problem. Only two updates were referred for legal advice, and only one of those rephrased. One was flagged and removed because it was ‘flamey’ and did not contribute to the investigation.

There was a lack of involvement by users across investigations. Users tended to stick to their own investigation and the idea of ‘helping another so they help you’ did not take root. Further research is needed to see if there was a power law distribution at work here – often seen on the internet – of a few people being involved in lots of investigations, most being involved in one, and a steep upward curve between.

In the next part I look at one particular investigation in an attempt to identify the qualities that made it successful.

If you want to get involved in the latest Help Me Investigate project, get in touch on paul@helpmeinvestigate.com

Getting Started With Twitter Analysis in R

Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.

I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…

Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)

require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)

#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content
head(df,3)

So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:

counts=table(df$screenName)
barplot(counts)

# Let's do something hacky:
# Limit the data set to show only folk who tweeted twice or more in the sample
cc=subset(counts,counts>1)
barplot(cc,las=2,cex.names =0.3)

Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:

#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))

#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…

require(ggplot2)
ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)

Crowdsourcing investigative journalism: a case study (part 2)

Continuing the serialisation of the research underpinning a new Help Me Investigate project, in this second part I describe the basis for the way that the original site was constructed – and the experiences of its first few months. (Part 1 is available here)

Building the site

By 2008 two members had joined the Help Me Investigate team: web developer Stef Lewandowski and community media specialist Nick Booth, and the project won funding that year from Channel 4’s 4iP fund and regional development agency Screen West Midlands.

Two part time members of ‘staff’ were recruited to work one day per week for the site throughout the 12 week funded ‘proof of concept’ period: a support journalist and a community manager.

Site construction began in April 2009, and began by expanding on the four target user profiles in the bid document to outline 12 profiles of users who might be attracted to the site, identifying what they would want to do with the site and how the design might facilitate that – or prevent it (as in the case, for example, of users who might want to hijack or hoax the site).

This was followed by rapid site development, and testing for 6 weeks with a small private beta. The plan was to use ‘agile’ principles of web development – launching when the site was not ‘finished’ to gain an understanding of how users actually interacted with the technology, and saving the majority of the development budget for ‘iterations’ of the software in response to user demand.

The resulting site experience can be described as follows: a user coming across the site was presented with two choices: to join an existing investigation, or start their own. If they started an investigation they would be provided with suggestions for ways of breaking it down into smaller tasks and of building a community around the question being pursued. If they joined an existing investigation they would be presented with those tasks – called ‘challenges’ – that needed completing to take the investigation forward. They could then choose to accept a particular challenge and share the results of their progress underneath.

The concepts of Actor-Network Theory (Paterson and Domingo, 2008) were accounted for in development: this describes how the ‘inventors’ of a technology are not the only actors that shape its use; the technology itself (including its limitations and its relationship with other technologies, and institutional and funding factors), and those who use it would also be vital in what happened from there.

Reserving the majority of the development budget to account for the influence of these ‘actors’ on the development of the technology was a key part of the planning of the site. This proved to be a wise strategy, as user behaviour differed in some respects from the team’s expectations, and development was able to adapt accordingly.

For legal reasons, casual visitors to the site (and search engines) could only see investigation titles (which were pre-moderated) and, later, the Reports and KnowledgeBase sections of the site (which were written by site staff). Challenges and updates (the results of challenges) – which were only post-moderated – could only be seen by registered users of the site.

A person could only become a user of the site if they were invited by another user. There was also a ‘request an invite’ section on the homepage. Non-UK requests were refused for legal reasons but most other requests were granted. At this stage the objective was not to build a huge user base but to develop a strong culture on the site that would then influence its healthy future development. This was a model based on the successful development of the constructive Seesmic video blogging community.

On July 1 HelpMeInvestigate.com went live with no promotion. The day after launch one tweet was published on Twitter, linking to the site. By the end of the week the site was investigating what would come to be one of the biggest stories of the summer in Birmingham – the overspend of £2.2m by the city council on a new website. It would go on to complete further investigations into parking tickets and the use of surveillance powers, as well as much smaller-scale questions such as how a complaint was handled, or why two bus companies were charging different prices on the same route.

In the next part I look at the strengths and limitations of the site’s model of working, and how people used the site in practice.

Crowdsourcing investigative journalism: a case study (part 1)

As I begin on a new Help Me Investigate project, I thought it was a good time to share some research I conducted into the first year of the site, and the key factors in how that project tried to crowdsource investigative and watchdog journalism.

The findings of this research have been key to the development of this new project. They also form the basis of a chapter in the book Face The Future, and another due to be published in the Handbook of Online Journalism next year (not to be confused with my own Online Journalism Handbook). Here’s the report:

In both academic and mainstream literature about the world wide web, one theme consistently recurs: the lowering of the barrier allowing individuals to collaborate in pursuit of a common goal. Whether it is creating the world’s biggest encyclopedia (Lih, 2009), spreading news about a protest (Morozov, 2011) or tracking down a stolen phone (Shirky, 2008), the rise of the network has seen a decline in the role of the formal organisation, including news organisations.

Two examples of this phenomenon were identified while researching a book chapter on investigative journalism and blogs (De Burgh, 2008). The first was an experiment by The Florida News Press: when it started receiving calls from readers complaining about high water and sewage connection charges for newly constructed homes the newspaper, short on in-house resources to investigate the leads, decided to ask their readers to help. The result is by now familiar as a textbook example of “crowdsourcing” – outsourcing a project to ‘the crowd’ or what Brogan & Smith (2009, p136) describe as “the ability to have access to many people at a time and to have them perform one small task each”:

“Readers spontaneously organized their own investigations: Retired engineers analyzed blueprints, accountants pored over balance sheets, and an inside whistle-blower leaked documents showing evidence of bid-rigging.” (Howe, 2006a)

The second example concerned contaminated pet food in the US, and did not involve a mainstream news organisation. In fact, it was frustration with poor mainstream ‘churnalism’ (see Davies, 2009) that motivated bloggers and internet users to start digging into the story. The resulting output from dozens of blogs ranged from useful information for pet owners and the latest news to the compilation of a database that suggested the official numbers of pet deaths recorded by the US Food and Drug Administration was short by several thousand. One site, Itchmo.com, became so popular that it was banned in China, the source of the pet food in question.

What was striking about both examples was not simply that people could organise to produce investigative journalism, but that this practice of ‘crowdsourcing’ had two key qualities that were particularly relevant to journalism’s role in a democracy. The first was engagement: in the case of the News-Press for six weeks the story generated more traffic to its website than “ever before, excepting hurricanes” (Weise, 2007). Given that investigative journalism often concerns very ‘dry’ subject matter that has to be made appealing to a wider audience, these figures were surprising – and encouraging for publishers.

The second quality was subject: the contaminated pet food story was, in terms of mainstream news values, unfashionable and unjustifiable in terms of investment of resources. It appeared that the crowdsourcing model of investigation might provide a way to investigate stories which were in the public interest but which commercial and public service news organisations would not consider worth their time. More broadly, research on crowdsourcing more generally suggested that it worked “best in areas that are not core to your product or central to your business model” (Tapscott and Williams, 2006, p82).

Investigative journalism: its history and discourses

DeBurgh (2008, p10) defines investigative journalism as “distinct from apparently similar work [of discovering truth and identifying lapses from it] done by police, lawyers and auditors and regulatory bodies in that it is not limited as to target, not legally founded and usually earns money for media publishers.” The term is notoriously problematic and contested: some argue that all journalism is investigative, or that the recent popularity of the term indicates the failure of ‘normal’ journalism to maintain investigative standards. This contestation is a symptom of the various factors underlying the growth of the genre, which range from journalists’ own sense of a democratic role, to professional ambition and publishers’ commercial and marketing objectives.

More recently investigative journalism has been used to defend traditional print journalism against online publishing, with publishers arguing that true investigative journalism cannot be maintained without the resources of a print operation. This position has become harder to defend as online-only operations and journalists have won increasing numbers of awards for their investigative work – Clare Sambrook in the UK and VoiceOfSanDiego.com and Talking Points Memo in the US are three examples – while new organisations have been established to pursue investigations without any associated print operation including Canada’s OpenFile; the UK’s Bureau of Investigative Journalism and a number of bodies in the US such as ProPublica, The Florida Center for Investigative Reporting, and the Huffington Post’s investigative unit.

In addition, computer technology has started to play an increasingly important role in print investigative journalism: Stephen Grey’s investigation into the CIA’s ‘extraordinary rendition’ programme (Grey, 2006) was facilitated by the use of software such as Analyst’s Notebook, which allowed him to analyse large amounts of flight data and identify leads. The Telegraph’s investigation into MPs’ expenses was made possible by digitisation of data and the ability to store large amounts on a small memory stick. And newspapers around the world collaborated with the Wikileaks website to analyse ‘warlogs’ from Iraq and Afghanistan, and hundreds of thousands of diplomatic cables. More broadly the success of Wikipedia inspired a raft of examples of ‘Wiki journalism’ where users were invited to contribute to editorial coverage of a particular issue or field, with varying degrees of success.

Meanwhile, investigative journalists such as The Guardian’s Paul Lewis have been exploring a more informal form of crowdsourcing, working with online communities to break stories including the role of police in the death of newspaper vendor Ian Tomlinson; the existence of undercover agents in the environmental protest movement; and the death of a man being deported to Angola (Belam, 2011b).

This is part of a broader move to networked journalism explored by Charlie Beckett (2008):

“In a world of ever-increasing media manipulation by government and business, it is even more important for investigative journalists to use technology and connectivity to reveal hidden truths. Networked journalists are open, interactive and share the process. Instead of gatekeepers they are facilitators: the public become co-producers. Networked journalists “are ‘medium agnostic’ and ‘story-centric’”. The process is faster and the information sticks around longer.” (2008, p147)

As one of its best-known practitioners Paul Lewis talks particularly of the role of technology in his investigations – specifically Twitter – but also the importance of the crowd itself and journalistic method:

“A crucial factor that makes crowd-sourcing a success [was that] there was a reason for people to help, in this case a perceived sense of injustice and that the official version of events did not tally with the truth. Six days after Tomlinson’s death, Paul had twenty reliable witnesses who could be placed on a map at the time of the incident – and only one of them had come from the traditional journalistic tool of a contact number in his notebook.” (Belam, 2011b)

A further key skill identified by Lewis is listening to the crowd – although he sounds a note of caution in its vulnerability to deliberately placed misinformation, and the need for verification.

“Crowd-sourcing doesn’t always work […] The most common thing is that you try, and you don’t find the information you want […] The pattern of movement of information on the internet is something journalists need to get their heads around. Individuals on the web in a crowd seem to behave like a flock of starlings – and you can’t control their direction.” (Belam, 2011b)

Conceptualising Help Me Investigate

The first plans for Help Me Investigate were made in 2008 and were further developed over the next 18 months. They built on research into crowdsourced investigative journalism, as well as other research into online journalism and community management. In particular the project sought to explore concepts of “P2P journalism” which enables “more engaged interaction between and amongst users” (Bruns, 2005, p120, emphasis in original) and of “produsage”, whose affordances included probabilistic problem solving, granular tasks, equipotentiality, and shared content (Bruns, 2008, p19).

A key feature in this was the ownership of the news agenda by users themselves (who could be either members of the public or journalists). This was partly for reasons identified above in research into the crowdsourced investigation into contaminated pet food. It would allow the site to identify questions that would not be considered viable for investigation within a traditional newsroom; but the feature was also implemented because ‘ownership’ was a key area of contestation identified within crowdsourcing research (Lih, 2009; Benkler, 2006; Surowiecki, 2005) – ‘outsourcing’ a project to a group of people raises obvious issues regarding claims of authorship, direction and benefits (Bruns, 2005).

These issues were considered carefully by the founders. The site adopted a user interface with three main modes of navigation for investigations: most-recent-top; most popular (those investigations with the most members); and two ‘featured’ investigations chosen by site staff: these were chosen on the basis that they were the most interesting editorially, or because they were attracting particular interest and activity from users at that moment. There was therefore an editorial role, but this was limited to only two of the 18 investigations listed on the ‘Investigations’ page, and was at least partly guided by user activity.

In addition there were further pages where users could explore investigations through different criteria such as those investigations that had been completed, or those investigations with particular tags (e.g. ‘environment’, ‘Bristol’, ‘FOI’, etc.).

A second feature of the site was that ‘journalism’ was intended to be a by-product: the investigation process itself was the primary objective, which would inform users, as research suggested that if users were to be attracted to the site, it must perform the function that they needed it to (Porter, 2008), which was – as became apparent – one of project management. The ‘problem’ that the site was attempting to ‘solve’ needed to be user-centric rather than publisher-centric: ‘telling stories’ would clearly be lower down the priority list for users than it was for journalists and publishers. Of higher priority were the need to break down a question into manageable pieces; find others to investigate those with; and get answers. This was eventually summarised in the strapline to the site: “Connect, mobilise, uncover”.

Thirdly, there was a decision to use ‘game mechanics’ that would make the process of investigation inherently rewarding. As the site and its users grew, the interface was changed so that challenges started on the left hand side of the screen, coloured red, then moved to the middle when accepted (the colour changing to amber), and finally to the right column when complete (now with green border and tick icon). This made it easier to see at a glance what needed doing and what had been achieved, and also introduced a level of innate satisfaction in the task. Users, the idea went, might grow to like to feeling of moving those little blocks across the screen, and the positive feedback (see Graham, 2010 and Dondlinger, 2007) provided by the interface.

Similar techniques were coincidentally explored at the same time by The Guardian’s MPs’ expenses app (Bradshaw, 2009). This provided an interface for users to investigate MP expense claim forms that used many conventions of game design, including a ‘progress bar’, leaderboards, and button-based interfaces. A second iteration of the app – created when a second batch of claim forms were released – saw a redesigned interface based on a stronger emphasis on positive feedback. As developer Martin Belam explains (2011a):

“When a second batch of documents were released, the team working on the app broke them down into much smaller assignments. That meant it was easier for a small contribution to push the totals along, and we didn’t get bogged down with the inertia of visibly seeing that there was a lot of documents still to process.

“By breaking it down into those smaller tasks, and staggering their start time, you concentrated all of the people taking part on one goal at a time. They could therefore see the progress dial for that individual goal move much faster than if you only showed the progress across the whole set of documents.”

These game mechanics are not limited to games: many social networking sites have borrowed the conventions to provide similar positive feedback to users. Jon Hickman (2010, p2) describes how Help Me Investigate uses these genre codes and conventions:

“In the same way that Twitter records numbers of “followers”, “tweets”, “following” and “listed”, Help Me Investigate records the number of “things” which the user is currently involved in investigating, plus the number of “challenges”, “updates” and “completed investigations” they have to their credit. In both Twitter and Help Me Investigate these labels have a mechanistic function: they act as hyperlinks to more information related to the user’s profile. They can also be considered culturally as symbolic references to the user’s social value to the network – they give a number and weight to the level of activity the user has achieved, and so can be used in informal ranking of the user’s worth, importance and usefulness within the network.” (2010, p8)

This was indeed the aim of the site design, and was related to a further aim of the site: to allow users to build ‘social capital’ within and through the site: users could add links to web presences and Twitter accounts, as well as add biographies and ‘tag’ themselves. They were also ranked in a ‘Most active’ table; and each investigation had its own graph of user activity. This meant that users might use the site not simply for information-gathering reasons, but also for reputation building ones, a characteristic of open source communities identified by Bruns (2005) and Leadbeater (2008) among others.

There were plans to take these ideas much further which were shelved during the proof of concept phase as the team concentrated on core functionality. For example, it was clear that users needed to be able to give other users praise for positive contributions, and they used the ‘update feature’ to do so. A more intuitive function allowing users to give a ‘thumbs up’ to a contribution would have made this easier, and also provided a way to establish the reputation of individual users, and encourage further use.

Another feature of the site’s construction was a networked rather than centralised design. The bid document to 4iP proposed to aggregate users’ material:

“via RSS and providing support to get users onto use web-based services. While the technology will facilitate community creation around investigations, the core strategy will be community-driven, ‘recruiting’ and supporting alpha users who can drive the site and community forward.”

Again, this aggregation functionality was dropped as part of focusing the initial version of the site. However, the basic principle of working within a network was retained, with many investigations including a challenge to blog about progress on other sites, or use external social networks to find possible contributors. The site included guidance on using tools elsewhere on the web, and many investigations linked to users’ blog posts.

In the second part I discuss the building of the site and reflections on the site’s initial few months.

Announcing Help Me Investigate: Networks

Help Me Investigate

Today I’m announcing the launch of a new Help Me Investigate project.

Help Me Investigate: Networks aims to make it easier to investigate public interest questions by providing resources and support, links to investigations across the web, and most importantly: a community.

The project is launching with a focus on 3 areas: Education, Health and Welfare. We’ll be providing tips from practising journalists, updates on ongoing investigations, and useful documents and data.

The existing site blog will continue to provide general advice on investigative journalism.

A launchpad and gathering place

This is an attempt to build a scalable network of journalists, developers and active citizens who are passionate about public interest issues.

Although we’re starting with a focus on three of those, if anyone is willing to manage sites covering other areas, including geographical ones, we may be able to host those too (and some are already being planned).

Unlike the original Help Me Investigate, most investigations will not take place on the HMI:Network sites – instead taking place on other blogs, or through private correspondence -although the tips, documents and data gathered in those investigations will be shared on the site.

How people are contributing

Different people are contributing to the project in different ways.

  • Journalists and bloggers who need help with getting answers to a question (extra eyes on data, legwork), or finding the questions themselves, are using the network as a place to connect.
  • Journalism tutors are tapping into the network for class projects.
  • Journalism students and graduates who want to explore a public interest issue are using it as a place to find others, get help, and publish what they find.

If you want to join in or find out more, please email me on paul@helpmeinvestigate.com or message me on Twitter @paulbradshaw. Or just tell someone about the project. They might find it useful.

Meanwhile, in the following days I’ll be publishing a series of posts about what I learned from the first version of Help Me Investigate, and how that has informed this new project.

Data Referenced Journalism and the Media – Still a Long Way to Go Yet?

Reading our local weekly press this evening (the Isle of Wight County Press), I noticed a page 5 headline declaring “Alarm over death rates at St Mary’s”, St Mary’s being the local general hospital. It seems a Department of Health report on hospital mortality rates came out earlier this week, and the Island’s hospital, it seems, has not performed so well…

Seeing the headline – and reading the report – I couldn’t help but think of Ben Goldacre’s Bad Science column in the Observer last week (DIY statistical analysis: experience the thrill of touching real data ), which commented on the potential for misleading reporting around bowel cancer death rates; among other things, the column described a statistical graphic known as a funnel plot which could be used to support the interpretation of death rate statistics and communicate the extent to which a particular death rate, for a given head of population, was “significantly unlikely” in statistical terms given the distribution of death rates across different population sizes.

I also put together a couple of posts describing how the funnel plot could be generated from a data set using the statistical programming language R.

Given the interest there appears to be around data journalism at the moment (amongst the digerati at least), I thought there might be a reasonable chance of finding some data inspired commentary around the hospital mortality figures. So what sort of report was produced by the Guardian (Call for inquiries at 36 NHS hospital trusts with high death rates) or the Telegraph (36 hospital trusts have higher than expected death rates), both of which have pioneering data journalists working for them, come up with? Little more than the official press release: New hospital mortality indicator to improve measurement of patient safety.

The reports were both formulaic, picking on leading with the worst performing hospital (which admittedly was not mentioned in the press release) and including some bog standard quotes from the responsible Minister lifted straight out of the press release (and presumably written by someone working for the Ministry…) Neither the Guardian nor the Telegraph story contained a link to the original data, which was linked to from the press release as part of the Notes to editors rider.

If we do a general, recency filtered, search for hospital death rates on either Google web search:

UK hosptial death rates reporting

or Google news search:

UK hospital death rate reporting

we see a wealth of stories from various local press outlets. This was a story with national reach and local colour, and local data set against a national backdrop to back it up. Rather than drawing on the Ministerial press released quotes, a quick scan of the local news reports suggests that at least the local journalists made some effort compared to the nationals’ churnalism, and got quotes from local NHS spokespeople to comment on the local figures. Most of the local reports I checked did not give a link to the original report, or dig too deeply into the data. However, This is Tamworth, (which had a Tamworth Herald byline in the Google News results), did publish the URL to the full report in its article Shock report reveals hospital has highest death rate in country, although not actually as a link… Just by the by, I also noticed the headline was flagged with a “Trusted Source” badge:

WHich is the trusted source?

Is that Tamworth Herald as the trusted source, or the Department of Health?!

Given that just a few days earlier, Ben Goldacre had provided an interesting way of looking at death rate data, it would have been nice to think that maybe it could have influenced someone out there to try something similar with the hospital mortality data. Indeed, if you check the original report, you can find a document describing How to interpret SHMI bandings and funnel plots (although, admittedly, not that clearly perhaps?). And along with the explanation, some example funnel plots.

However, the plots as provided are not that useful. They aren’t available as image files in a social or rich media press release format, nor are statistical analysis scripts that would allow the plots to be generated from the supplied data in too like R; that is to say, the executable working wasn’t shown…

So here’s what I’m thinking: firstly, we need data press officers as well as data journalists. Their job would be to put together the tools that support the data churnalist in taking the raw data and producing statistical charts and interpretation from it. Just like the ministerial quote can be reused by the journalist, so the data press pack can be used to hep the journalist get some graphs out there to help them illustrate the story. (The finishing of the graph would be up to the journalist, but the mechanics of the generation of the base plot would be provided as part of the data press pack.)

Secondly, there may be an opportunity for an enterprising individual to take the data sets and produced localised statistical graphics from the source data. In the absence of a data press officer, the enterprising individual could even fulfil this role. (To a certain extent, that’s what the Guardian Datastore does.)

(Okay, I know: the local press will have allocated only a certain amount of space to the story, and the editor would likely see any mention of stats or funnel plots as scaring folk off, but we have to start changing attitudes, expectations, willingness and ability to engage with this sort of stuff somehow. Most people have very little education in reading any charts other than pie charts, bar charts, and line charts, and even then are easily misled. We have start working on this, we have to start looking at ways of introducing more powerful plots and charts and helping people get a folk understanding of them. And funnel plots may be one of the things we should be starting to push?)

Now back to the hospital data. In How Might Data Journalists Show Their Working? Sweave, I posted a script that included the working for generating a funnel plot from an appropriate online CSV data source. Could this script be used to generate a funnel plot from the hospital data?

I had a quick play, and managed to get a scatterplot distribution that looks like the one on the funnel plot explanation guide by setting the number value to the SHMI Indicator data (csv) EXPECTED column and the p to the VALUE column. However, because the p value isn’t a probability in the range 0..1, the p.se calculation fails:
p.se <- sqrt((p*(1-p)) / (number))

Anyway, here’s the script for generating the straightforward scatter plot (I had to read the data in from a local file because there was some issue with the security certificate trying to read the data in from the online URL using the RCurl library and hospitaldata = data.frame( read.csv( textConnection( getURL( DATA_URL ) ) ) ):

hospitaldata = read.csv("~/Downloads/SHMI_10_10_2011.csv")
number = hospitaldata$EXPECTED
p = hospitaldata$VALUE
df = data.frame(p, number, Area=hospitaldata$PROVIDER.NAME)
ggplot(aes(x = number, y = p), data = df) + geom_point(shape = 1)

There’s presumably a simple fix to the original script that will take the range of the VALUE column into account and allow us to plot the funnel distribution lines appropriately? If anyone can suggest the fix, please let me know in a comment…;-)

Sports Data Journalism and “Datatainment”

Over the last couple of years, you’ve probably noticed that data has become a Big Thing in commerce (Big Data for business advantage) as well as in the openness/transparency community, with governments and the media joining the party particularly in the context of the latter. But if you’re looking to develop data journalism skills, it’s probably also worth remembering the area of sports journalism, and the wealth of data produced around sporting events.

Part of the attraction of developing learning activities around sports data is that there’s a good chance that it’ll keep on delivering… If you develop a way of analysing or displaying sports data that pulls out interesting features or story elements from a set of sports data, you should be able to keep on using it… To set the scene, here’s a example: Driven By Data: Data Journalism in Sports. For a peek at my own fumblings, I’ve started exploring the automatic creation of F1DataJunkie Stats Graphics reports (still a lot to be done, but it’s a start…)

In the extreme case, you might be able to generate story outlines, or even canned prose… For example, in certain computer games in the sports genre, you might find you’re playing a game along to a “live commentary”, generated from the data being produced by the game. Automatic commentary generation is a form of sports journalism. And automated article generation is already here, as @RobbieAllen describes in How I automated my writing career, a brief overview of Automated Insights, a company that specialises in computer generated visualisations and prose.

See also: Automated Storytelling in Sports: A Rich Domain to Be Explored, Automated Event Recognition for Football Commentary Generation, Three RoboCup Simulation League Commentator Systems, and so on…

Getting hold of data is always an issue, of course, but I suspect that many larger newsrooms will take a subscription to the Press Association sports data feeds, for example…

Anyway, as an exercise, here’s some data to start with, from the Guardian datastore: Premier League’s top scorers: who is scoring the most goals? Is there a correlation with age, perhaps? (Where would you find the age data…?)

As well as sports reporting, I think we’re also likely to see an increase in what Head of Digital at Manchester City FC, Richard Ayers, referes to as datatainment: “where you use data as the primary source of entertainment. You might choose to make the visualisation of raw data entertaining or perhaps use data visualisation as part of the process of entertainment – but there’s definitely a strong editorial control which is focussed on entertaining the audience rather than exposing data.” (Data? Entertainment? You need Datatainment and Defining Data Visualisation, Data Journalism & Data Entertainment).

Devices such as FanVision already blend video and audio streams with data feeds, for example, more and more sports have “live stats apps” associated with them, and it’s not hard to imagine the data crunching that goes on under the hood in things like Optiplay making an appearance on sports analysis and review sites?

I also think that the “data as entertainment” line might work well as a second screen activity. Things like the F1 Live Timing app already demonstrate this:

On the other hand, there’s an opportunity for data focussed sites that go into deep analysis for the hardcore fan. Again looking at Formula One, the Intelligent F1 blog features a data-powered model developed by a rocket scientist that provides engagment oaround a particular race over an extended period, from predicting Sunday race behaviour based on Friday practice data and previous outings, through analysis of practice and qualifying data, to a detailed series of post-race analyses. (Complement this with technical analyses applied to the cars on the Scarbs F1, and you have the ultimate F1 geeks paradise!;-)

PS This also caught my eye: Gametime [Assistant]: Girls’ Lacrosse Game Data, which steps through the design of a “datatainment” app…

PPS as the Lacrosse app suggests, the data collection thing can also improve engagement with a live event. For example, my own doodlings around a motorsport lapcharting app (Thoughts on a Couple of Possible Lap Charting Apps, initial code experiment)