Monthly Archives: November 2012

The US election was a wake up call for data illiterate journalists

So Nate Silver won in 50 states; big data was the winner; and Nate Silver and data won the election. And somewhere along the lines some guy called Obama won something, too.

Elections set the pace for much of journalism’s development: predictable enough to allow for advance planning; big enough to justify the budgets to match, they are the stage on which news organisations do their growing up in public.

For most of the past decade, those elections have been about social media: the YouTube election; the Facebook election; the Twitter election. This time, it wasn’t about the campaigning (yet) so much as it was about the reporting. And how stupid some reporters ended up looking. Continue reading

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

Regular Expressions cartoon on xkcd

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful. 

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading

Data alone isn’t enough – Tim Davies on “complexity and complementarity”

If people aren’t using data it isn’t just a problem for web developers – it’s a problem for journalists too. If not enough people are looking at information on crime, politics, health, education, or welfare then it makes our work harder.

On that subject, Tim Davies writes about the challenges of ‘getting data used’ and the inclination to focus on data-centric solutions. “Data quality, poor meta-data, inaccessible language, and the difficulty of finding wheat amongst the chaff of data were all diagnosed [at one hack day] as part of the problem,” he reports. “Yet these diagnosis and solutions are still based on linear thinking: when a dataset is truly accessible, then it will be used, and economic benefits will flow. Continue reading