How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)

5 Replies

CableSearch is a neat project by the European Centre for Computer Assisted Research and VVOJ (the Dutch-Flemish association for investigative journalists) which aims to make it easier for journalists to interrogate the Wikileaks cables. Although it’s been around for some time, I’ve only just noticed the site’s API, so I thought I’d show how such an API can be useful as a way to draw on such data sources to complement data of your own. Continue reading →

Gathering data: a flow chart for data journalists

3 Replies

Above is a flow chart that I sketched out during a long car journey to the Balkan Investigative Reporters Network Summer School in Croatia (don’t worry: I wasn’t driving).

It aims to help those doing data journalism identify how best to get hold of and deal with data by asking a series of questions about the information you want to compile and making suggestions on ways both to get hold of it and tools to then get it into a state which makes it easier to ask questions.

It also illustrates at a glance how the process of ‘getting hold of the data’ can vary widely, and how different projects can often involve completely different tools and skillsets from previous ones.

I will have missed obvious things, so please help me improve this. And if you find it useful, let me know.

Click on the image for other sizes.

Gathering data: a flow chart for data journalists

9 Replies

Above is a flow chart that I sketched out during a long car journey to the Balkan Investigative Reporters Network Summer School in Croatia (don’t worry: I wasn’t driving).

I will have missed obvious things, so please help me improve this. And if you find it useful, let me know.

Click on the image for other sizes.

Using Google Spreadsheets as a Database Source for R

I couldn’t contain myself (other more pressing things to do, but…), so I just took a quick time out and a coffee to put together a quick and dirty R function that will let me run queries over Google spreadsheet data sources and essentially treat them as database tables (e.g. Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

Here’s the function:

library(RCurl)
gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'http://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid) ) ) }

It requires the spreadsheet key value and a query; you can optionally provide a sheet number within the spreadsheet if the sheet you want to query is not the first one.

We can call the function as follows:

gsqAPI('tPfI0kerLllVLcQw7-P1FcQ','select * limit 3')

In that example, and by default, we run the query against the first sheet in the spreadsheet.

Alternatively, we can make a call like this, and run a query against sheet 3, for example:
tmpData=gsqAPI('0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc','select A,C where <= 10',3) tmpData

The real question is, of course, could it be useful.. (or even OUseful?!)?

Here’s another example: a way of querying the Guardian Datastore list of spreadsheets:

gsqAPI('0AonYZs4MzlZbdFdJWGRKYnhvWlB4S25OVmZhN0Y3WHc','select * where A contains "crime" and B contains "href" order by C desc limit 10')

What that call does is run a query against the Guardian Datastore spreadsheet that lists all the other Guardian Datastore spreadsheets, and pulls out references to spreadsheets relating to “crime”.

The returned data is a bit messy and requires parsing to be properly useful.. but I haven’t started looking at string manipulation in R yet…(So my question is: given a dataframe with a column containing things like <a href=”http://example.com/whatever”>Some Page</a>, how would I extract columns containing http://example.com/whatever or Some Page fields?)

[UPDATE: as well as indexing a sheet by sheet number, you can index it by sheet name, but you’ll probably need to tweak the function to look end with '&gid=', curlEscape(gid) so that things like spaces in the sheet name get handled properly I’m not sure about this now.. calling sheet by name works when accessing the “normal” Google spreadsheets application, but I’m not sure it does for the chart query language call??? ]

[If you haven’t yet discovered R, it’s an environment that was developed for doing stats… I use the RStudio environment to play with it. The more I use it (and I’ve only just started exploring what it can do), the more I think it provides a very powerful environment for working with data in quite a tangible way, not least for reshaping it and visualising it, let alone doing stats with in. (In fact, don’t use the stats bit if you don’t want to; it provides more than enough data mechanic tools to be going on with;-)]

PS By the by, I’m syndicating my Rstats tagged posts through the R-Bloggers site. If you’re at all interested in seeing what’s possible with R, I recommend you subscribe to R-Bloggers, or at least have a quick skim through some of the posts on there…

PPS The RSpatialTips post Accessing Google Spreadsheets from R has a couple of really handy tips for tidying up data pulled in from Google Spreadsheets; assuming the spreadsheetdata has been loaded into ssdata: a) tidy up column names using colnames(ssdata) <- c("my.Col.Name1","my.Col.Name2",...,"my.Col.NameN"); b) If a column returns numbers as non-numeric data (eg as a string "1,000") in cols 3 to 5, convert it to a numeric using something like: for (i in 3:5) ssdata[,i] <- as.numeric(gsub(",","",ssdata[,i])) [The last column can be identifed as ncol(ssdata) You can do a more aggessive conversion to numbers (assuming no decimal points) using gsub("[^0-9]“,”",ssdata[,i])]

PPPS via Revolutions blog, how to read the https file into R (unchecked):

require(RCurl)
myCsv = getURL(httpsCSVurl)
read.csv(textConnection(myCsv))

Has investigative journalism found its feet online? (part 3)

3 Replies

Previously this serialised chapter for the forthcoming book Investigative Journalism: Dead or Alive? looked at new business models surrounding investigative journalism and online investigative journalism as a genre. This third and final part looks at how changing supplies of information change the context within which investigative journalism operates.

What next for investigative journalism in a world of information overload?

But this identity crisis does highlight a final, important, question to be asked: in a world where users have direct access to a wealth of information themselves, what is investigative journalism for? I would argue that it comes down to the concept of “uncovering the hidden”, and in exploring this it is useful to draw an analogy with the general journalistic idea of “reporting the new”.

Trainee journalists sometimes see “new” in limited terms – as simply what is happening today. But what is “new” is not limited to that. It can also be what is happening tomorrow, or what happened 30 years ago. It can be something that someone has said about an “old story” days later, or an emerging anger about something that was never seen as “newsworthy” to begin with. The talent of the journalist is to be able to spot that “newness”, and communicate it effectively.

Journalism typically becomes investigative when that newness involves uncovering the hidden – and that can be anything that our audience couldn’t see before – it could be a victim’s story, a buried report, 250,000 cables accessible to 2.5 million people, or even information that is publicly available but has not been connected before (“the hidden” – like “the new” is, of course, a subjective quality, dependent on the talent of a particular journalist for finding something in it – or a way of seeing it – that is newsworthy). Continue reading →

Has investigative journalism found its feet online? (part 2)

3 Replies

The first part of this serialised chapter for the forthcoming book Investigative Journalism: Dead or Alive? looked at new business models surrounding investigative journalism. This second part looks at how new ways of gathering, producing and distributing investigative journalism are emerging online.

Online investigative journalism as a genre

Over many decades print and broadcast investigative journalism have developed their own languages: the spectacular scoop; the damning document; the reporter-goes-undercover; the doorstep confrontation, and so on. Does online investigative journalism have such a language? Not quite. Like online journalism as a whole, it is still finding its own voice. But this does not mean that it lacks its own voice.

For some the internet appears too fleeting for serious journalism. How can you do justice to a complex issue in 140 characters? How can you penetrate the fog of comment thread flame wars, or the “echo chambers” of users talking to themselves? For others, the internet offers something new: unlimited space for expansion beyond the 1,000 word article or 30-minute broadcast; a place where you might take some knowledge, at least, for granted, instead of having to start from a base of zero. A more cooperative and engaged medium where you can answer questions directly, where your former audience is now also your distributor, your sub-editor, your source.

The difference in perception is largely a result of people mistaking parts for the whole. The internet is not Twitter, or comment threads, or blogs. It is a collection of linked objects and people – in other words: all of the above, operating together, each used, ideally, to their strengths, and also, often in relationship to offline media. Continue reading →

Has investigative journalism found its feet online? (part 1)

5 Replies

Earlier this year I was asked to write a chapter for a book on the future of investigative journalism – ‘Investigative Journalism: Dead Or Alive?‘. I’m reproducing it here. The chapter was originally published on my Facebook page. An open event around the book’s launch, with a panel discussion, is being held at the Frontline Club next month.

We may finally be moving past the troubled youth of the internet as a medium for investigative journalism. For more than a decade observers looked at this ungainly form stumbling its way around journalism, and said: “It will never be able to do this properly.”

They had short memories, of course. Television was an equally awkward child: the first news broadcast was simply a radio bulletin on a black screen, and for decades print journalists sneered at the idea that this fleeting, image-obsessed medium could ever do justice to investigative journalism. But it did. And it did it superbly, finding a new way to engage people with the dry, with the political, and the complex.
Continue reading →

SFTW: 9 data journalism tools

12 Replies

There have been quite a few tools springing up over the past few months that I’ve not had time to blog about, so here’s a roundup post on all of them – a bumper Something For The Weekend (let me know how you find these).

1. Junar – for scraping websites and sharing data

Junar presents a much easier way to scrape data from online tables with its ‘Collect Data‘ tool – and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs. Continue reading →

When will we stop saying “Pictures from Twitter” and “Video from YouTube”?

25 Replies

Image from YouTube

Over the weekend the BBC had to deal with the embarrassing ignorance of someone in their complaints department who appeared to believe that images shared on Twitter were “public domain” and “therefore … not subject to the same copyright laws” as material outside social networks.

A blog post, from online communities adviser Andy Mabbett, gathered thousands of pageviews in a matter of hours before the BBC’s Social Media Editor Chris Hamilton quickly responded:

“We make every effort to contact people, as copyright holders, who’ve taken photos we want to use in our coverage.

“In exceptional situations, ie a major news story, where there is a strong public interest in making a photo available to a wide audience, we may seek clearance after we’ve first used it.”

(Chris also published a blog post yesterday expanding on some of the issues, the comments on which are also worth reading)

The copyright issue – and the existence of a member of BBC staff who hadn’t read the Corporation’s own guidelines on the matter – was a distraction. What really rumbled through the 170+ comments – and indeed Andy’s original complaint – was the issue of attribution.

Continue reading →