Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

Regular Expressions slogan t-shirt

Image by Lasse Havelund

In the second part of this extract from Chapter 10 of Scraping for Journalists I recap the basics before discussing techniques to use in looking for patterns in data, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’. You can find the first part here.

 

Continue reading

The US election was a wake up call for data illiterate journalists

So Nate Silver won in 50 states; big data was the winner; and Nate Silver and data won the election. And somewhere along the lines some guy called Obama won something, too.

Elections set the pace for much of journalism’s development: predictable enough to allow for advance planning; big enough to justify the budgets to match, they are the stage on which news organisations do their growing up in public.

For most of the past decade, those elections have been about social media: the YouTube election; the Facebook election; the Twitter election. This time, it wasn’t about the campaigning (yet) so much as it was about the reporting. And how stupid some reporters ended up looking. Continue reading

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

Regular Expressions cartoon on xkcd

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful. 

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading

Data alone isn’t enough – Tim Davies on “complexity and complementarity”

If people aren’t using data it isn’t just a problem for web developers – it’s a problem for journalists too. If not enough people are looking at information on crime, politics, health, education, or welfare then it makes our work harder.

On that subject, Tim Davies writes about the challenges of ‘getting data used’ and the inclination to focus on data-centric solutions. “Data quality, poor meta-data, inaccessible language, and the difficulty of finding wheat amongst the chaff of data were all diagnosed [at one hack day] as part of the problem,” he reports. “Yet these diagnosis and solutions are still based on linear thinking: when a dataset is truly accessible, then it will be used, and economic benefits will flow. Continue reading

Hurricane Sandy: how does the media serve the public interest?

This tweet from Daniel Bentley deserves a post all on its own:

 

While some news organisations take down paywalls and others help sort hoax images from the genuine article, what role should ‘common carriers’ like Instagram play? Any at all?

Jon Bounds: Why I’m giving up Birmingham: It’s Not Shit

Some of the products available in the BiNS shop

Jon Bounds is one of Birmingham’s most established and best-known bloggers. In this guest post, cross-posted from his own blog, he explains why he’s auctioning off that site, the reasons he started it in the first place, and the problems with the ‘hyperlocal question’.

I started Birmingham: It’s Not Shit back in the May of 2002, before there were really such things as blogs in the mainstream and the term ‘hyperlocal’ was not even a glint in an irritating theorist’s eye.

Pretty much everything that’s ever been on it, and definitely everything technical was written or created by me. I’ve had a couple of ‘columnists’ for short whiles and a couple of bits of ‘holiday cover’ but that’s all.

The site was flat, hand coded HTML until I learned of PHP and wrote a simple news updating section. Later I discovered that there wasn’t only a name for such things but software out there to do it more prettily and better.

And now it, or sites like it, are either the future of the media or a disappointment to those that thought they should be.

But, it didn’t start because the media was dying, it started because the media was crapcrap at explaining why people connected emotionally with a place that—when looked at objectively—was a bit shit. Crap at self awareness, crap at understanding real life. The media has changed a little, but mostly the contents have just shifted in transit. Continue reading

Data visualisation training

If you’re interested in data visualisation I’m delivering a training course on November 7 with the excellent Caroline Beavon. Here’s what we’re covering:

  • Pick the right chart for your story – against a deadline
  • Mapping tricks and techniques: using Fusion Tables and other tools to map Olympic torchbearers
  • Picking the right data to visualise
  • Visualisation tips for free chart tools
  • Avoiding common visualisation mistakes
  • Create an infographic with Tableau and Illustrator
  • Making data interactive

More details here. Places can be booked here.

Circa news app shows liveblogging’s rise and rise

Circa news app - image from TechCrunch

Circa news app – image from TechCrunch

Just how dominant can the liveblog format become? In Model for the 21st Century Newsroom Redux I noted how quickly the format has been adopted as a default mode of reporting (for example, how widely the format was being used to report on public sector strikes).

In March 2012 the relaunch of the ITV News website saw the format adopted as the default mode of presentation.

In August The Guardian’s horizontally-navigated Olympics liveblog caught my eye.

Now news app Circa is taking the concept back onto mobile (where most liveblogging starts), and adding a few twists, including push updates. Continue reading