Monthly Archives: July 2010

An introduction to data scraping with Scraperwiki

Last week I spent a day playing with the screen scraping website Scraperwiki with a class of MA Online Journalism students and a local blogger or two, led by Scraperwiki’s own Anna Powell-Smith. I thought I might take the opportunity to try to explain what screen scraping is through the functionality of Scraperwiki, in journalistic terms.

It’s pretty good.
Continue reading

Don't stop us digging into public spending data

A disturbing discovery by Chris Taggart last week: a number of councils in the UK are handing over their ‘open’ data to a company which only allows it to be downloaded for “personal” use.

As Chris himself points out, this runs completely against the spirit of the push to release public data in a number of ways:

  • Data cannot be used for “commercial gain”. This includes publishers wanting to present the information in ways that make most sense to the reader, and startups wanting to find innovative ways to involve people in their local area. Oh, and that whole ‘Big Society‘ stuff.
  • The way the sites are built means you couldn’t scrape this information with a computer anyway
  • It’s only a part of the data. “Download the data from SpotlightOnSpend and it’s rather different from the published data [on the Windsor & Maidenhead site]. Different in that it is missing core data that is in W&M published data (e.g. categories), and that includes data that isn’t in the published data (e.g. data from 2008).”

It’s a worrying path. As Chris sums it up: ” Councils hand over all their valuable financial data to a company which aggregates for its own purposes, and, er, doesn’t open up the data, shooting down all those goals of mashing up the data, using the community to analyse and undermining much of the good work that’s been done.”

The Transparency Board quickly issued a statement about this issue saying that “urgent” measures are taking place to rectify the problem.

And Spikes Cavell, who make the software, responded in Information Age, pointing out that “it is first and foremost a spend analysis software and consultancy supplier, and that it publishes data through SpotlightOnSpend as a free, optional and supplementary service for its local government customers. The hope is that this might help the company to win business, he explains, but it is not a money-spinner in itself.”

They are now promising to make the data available for download in its “raw form”, although it’s not clear what that will be. Adrian Short’s comment to the piece is worth reading.

Nevertheless, this is an issue that anyone interested in holding power to account should keep a close eye on. And to that aim, Chris has started an investigation on Help Me Investigate to find out how and why councils are giving access to their spending data. Please join it and help here.

(Comment or email me on paul at helpmeinvestigate.com if you want an invitation.)

77,000 pageviews and multimedia archive journalism (MA Online Journalism multimedia projects pt4)

(Read part 1 here; part 2 here and part 3 here)

The ‘breadth portfolio’ was only worth 20% of the Multimedia Journalism module, and was largely intended to be exploratory, but Alex Gamela used it to produce work that most journalists would be proud of.

Firstly, he worked with maps and forms to cover the Madeira Island mudslides:

“When on the 20th of February a storm hit Madeira Island, causing mudslides and floods, the silence on most news websites, radios and TV stations was deafening. But on Twitter there were accounts from local people about what was going on, and, above all, they had videos. The event was being tagged as #tempmad, so it was easy to follow all the developments, but the information seemed to be too scattered to get a real picture of what was going on in the island, and since there was no one organizing the information available, I decided to create a map on Google[ii], to place videos, pictures and other relevant information.

“It got 10,000 views in the first hours and reached 30,000 in just two days. One month later, it has the impressive number of 77 thousand visits.”

Not bad, then.

Secondly, Alex experimented with data visualisation to look at newspaper brand values and the online traffic of Portuguese news websites.

“My goal was to understand the relative and proportional position of each one, regarding visits, page views, and how those two values relate to each other. The data I got also has portals, specialized websites, and entertainment magazines so it has a broad range of themes (all charts are available live here – http://is.gd/aZLXs)”

And finally, he produced a beautiful Flash interactive on Moseley Road Baths (which he talks about here).

All of which was produced and submitted within the first six weeks of the Multimedia Journalism module.

The other 80%: multimedia archive journalism

Alex was particularly interested in archive journalism and using multimedia to bring archives to life. As a way of exploring this he produced the Paranoia Timeline, a website exploring “all the events that caused some type of social hysteria throughout the world in the last 20 years.

“Some of the situations presented here were real dangers, others not really. But all caused disturbances in our daily lives … Why does that happen? Why are we caught in these bursts of information, sometimes based on speculative data and other times borne out of the imagination of few and fed by the beliefs of many?”

The site – which is an ongoing project in its earliest stages – combines video, visualisation, a Dipity timeline, mapping and the results of some fascinating data and archive journalism. Alex explains:

“The swine flu data came from Wolfram-Alpha[vi] that generated a rather reliable (after cross checking with other official websites) amount of data, with the number of cases and deaths per country. I had to make an option about which would be highlighted, but discrepancies in the logical amount of cases between countries made me go just for the death numbers. The conclusion that I got from the map is that swine flu was either more serious or reported in the developed countries. Traditionally considered Third World countries do not have many reports, which reflect the lack of structures to deal with the problem or how overhyped it was in the Western world. But France on its own had almost 3 million cases reported against 57 thousand in the United States, which led me to verify closely other sources. It seems Wolfram Alpha had the number wrong, there were only about 5000 reports, which proves that outliers in data are either new stories or just input errors.

“For the credit crunch[vii], I researched the FDIC – Federal Deposit Insurance Corporation[viii] database. They have a considerable amount of statistical data available for download. My idea was to chart the evolution of loans in the United States in the last years, and the main idea was that overall loans slowed down since 2009 but individual credits rose, meaning an increase in personal debt to cope with overall difficulties caused by the crunch.I selected the items that seemed more relevant and went for a simple line chart. My purpose was served.”

“Though the current result falls short of my initial goals,” says Alex, “it is a prototype for a more involving experience, and I consider it to be a work in construction. What I’ll be defending here is a concept with a few examples using interactive tools, but I realize this is just a small sample of what it can really be: an immersive, ongoing project, with more interactive features, providing a journalistic approach to issues highly debated and prone to partisanship, many of them used by religious and political groups to spin their own ideologies to the general audience. The purpose is to create context.”

Alex is currently back in Portugal as he completes the final MA by Production part of his Masters. You might want to hire him, or Caroline, Dan, Ruihua, Chiara, Natalie or Andy.

Using data to scrutinise local swimming facilities (MA Online Journalism multimedia projects pt3)

(Read part 1 here and part 2 here)

The third student to catch the data journalism bug was Andy Brightwell. Through his earlier reporting on swimming pool facilities in Birmingham, Andy had developed an interest in the issue, and wanted to use data journalism techniques to dig further.

The result was a standalone site – Where Can We Swim? – which documented exactly how he did that digging, and presented the results.

He also blogged about the results for social media firm Podnosh, where he has been working.
Continue reading

Announcing the Birmingham Hacks & Hackers day

If you are a journalist, blogger or developer interested in the possibilities of public data I’d be very happy if you came to a Hack Day I’m involved in, here in Birmingham on Friday July 23.

The idea is very simple: we get a bunch of public data, and either find stories in it, or ways to help others find stories.

You don’t need technical expertise because that’s why the hackers are there; and you don’t need journalistic expertise because that’s why the hacks are there.

What I’m particularly excited about in Birmingham is that we’ve got a real mix of people coming – from press and broadcast, and local bloggers, and hopefully a mix of people with backgrounds in various programming languages and even gaming.

And apart from all that there should be free beer and pizza. Which is the important thing.

So come.

The day is being organised by Scraperwiki and we’ve already got a whole bunch of interesting people signed up.

You can register for the day here.

Local history as a game (MA Online Journalism multimedia projects pt2)

Following on from the previous post on serious music journalism using data, here’s some more detail on how MA Online Journalism students have been exploring multimedia journalism.

Using data to shed light on dangers for cyclists

Dan Davies explored video and mapping audio before catching the data bug – in this case, around cycling collisions. Like Caroline, he sourced data from a range of sources, including media reports, an RSS feed from FixMyStreet, another RSS feed from Google News, Freedom of Information requests – and getting out there and collecting it himself.

He’s visualised the data in a range of ways at Birmingham Cycle Data, using tools such as Yahoo! Pipes and ManyEyes, and collaborated with cycling communities too. The results provide a range of insights into transport issues for cyclists: Continue reading

Music journalism and data (MA Online Journalism multimedia projects pt1)

I’ve just finished looking at the work from the Diploma stage of my MA in Online Journalism, and – if you’ll forgive the effusiveness – boy is it good.

The work includes data visualisation, Flash, video, mapping and game journalism – in short, everything you’d want from a group of people who are not merely learning how to do journalism but exploring what journalism can become in a networked age.

But before I get to the detail, a bit of background… Continue reading