Category Archives: databases

Peru data journalism project Convoca launches interactive tool on mining infractions

Screenshot of Convoca map

Peruvian news organisation Convoca has launched an interactive tool to enable citizens to access environmental information related to the behaviour of Peruvian mining companies.

The tool maps more than one thousand resolutions of sanctions made by a Peruvian supervisory body of the environment to penalise infractions committed by 132 enterprises. Continue reading

How to: clean a converted PDF using Open Refine

Our initial table

This spreadsheet sent in response to an FOI request appeared to have been converted from PDF format

In a guest post post for OJB, Ion Mates explains how he used OpenRefine to clean up a spreadsheet which had been converted from PDF format. An earlier version of this post was published on his blog.

Journalists rarely get their hands on nice, tidy data: public bodies don’t have an interest in providing information in a structured form. So it is increasingly part of a journalist’s job to get that information into the right state before extracting patterns and stories.

A few months ago I sent a Freedom of Information request asking for the locations of all litter bins in Birmingham. But instead of sending a spreadsheet downloaded directly from their database, the spreadsheet they sent appeared to have been converted from a multiple-page PDF.

This meant all sorts of problems, from rows containing page numbers and repeated header rows, to information split across multiple rows and even pages.

In this post I’ll be taking you through I used the free data cleaning tool OpenRefine (formerly Google Refine) to tackle all these problems and create a clean version of the data. Continue reading

The 10 most-read posts (and one page) on the Online Journalism Blog in 2014

ojb post frequency 2014

The last 2 months of 2014 saw a return to regular blogging after some quiet periods earlier in the year

2014 was the 10th anniversary of the Online Journalism Blog, so I thought I’d better begin keeping track of what each year’s most-read posts were.

In 2014 the overriding themes for this blog were programming for journalists, web security, and social media optimisation. Here are the most-read posts of the year, plus one surprisingly popular new page with some background and updates. Continue reading

FAQ: Do you need new ethics for computational journalism?

This latest post in the FAQ series answers questions posed by a student in Belgium regarding ethics and data journalism.

Q: Do ethical issues in the practice of computational journalism differ from those of “traditional” journalism?

No, I don’t think they do particularly – any more than ethics in journalism differ from ethics in life in general. However, as in journalism versus life, there are areas which attract more attention because they are the places we find the most conflict between different ethical demands.

For example, the tension between public interest and an individual’s right to privacy is a general ethical issue in journalism but which has particular salience in data journalism, when you’re dealing with data which names individuals.

I wrote about this in a book chapter which I’ve published in parts on the blog. Continue reading

That massive open online course on data journalism now has a start date

In case you haven’t seen the tweets and blog posts, that MOOC on data journalism I’m involved in has a start date: May 19.

The launch was delayed a little due to the amount of people who signed up – which I think was a sensible decision.

You can watch the introduction video above, or ‘meet the instructors’ below. Looking forward to this…

Why open data matters – a (very bad) example from Universal Jobmatch

Open Data stickers image by Jonathan Gray

Open Data stickers image by Jonathan Gray

I come upon examples of bad practice in publishing government data on a regular basis, but the Universal Jobmatch tool is an example so bad I just had to write about it. In fact, it’s worse than the old-fashioned data service that preceded it.

That older service was the Office for National Statistics’ labour market service NOMIS, which published data on Jobcentre vacancies and claimants until late 2012, when Jobcentre Plus was given responsibility for publishing the data using their Universal Jobmatch tool.

Despite a number of concerns, more than a year on, Universal Jobmatch‘s reports section has ignored at least half of the public data principles first drafted by the Government’s Public Sector Transparency Board in 2010, and published in 2012. Continue reading

Saving the evidence in Ukraine: collaborate first – or you won’t be able to ask questions later

YanukovychLeaks screengrab

“The reporters then did something remarkable. They made a decision to cooperate among all the news organizations and to save first and report later.

“It wasn’t an easy decision. But it was clear that if they didn’t act, critical records of their own country’s history could be lost. The scene was already filling with other reporters eager to grab what stories they could and leave. In contrast, the group was joined by a handful of other like-minded journalists: Anna Babinets of Slidstvo/TV Hromadske;  Oleksandr Akymenko, formerly of Forbes; Katya Gorchinska and Vlad Lavrov of the Kyiv Post. Radio Free Europe reporter Natalie Sedletska returned from Prague so she could help, and others came, too.

“… In the tense situation that characterizes Ukraine, conspiracies form quickly. To demonstrate their transparency, the organizers quickly moved to get documents up. By early Tuesday, nearly 400 documents, a fraction of the estimated 20,000 to 50,000 documents, had been posted. Dozens more are being added by the hour.”

Drew Sullivan writes about Yanukovych Leaks.