Tag Archives: regex

Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

Regular Expressions slogan t-shirt

Image by Lasse Havelund

In the second part of this extract from Chapter 10 of Scraping for Journalists I recap the basics before discussing techniques to use in looking for patterns in data, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’. You can find the first part here.

 

Continue reading

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

Regular Expressions cartoon on xkcd

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful. 

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading

Review: Yahoo! Pipes tutorial ebook

Pipes Tutorial ebook

I’ve been writing about Yahoo! Pipes for some time, and am consistently surprised that there aren’t more books on the tool. Pipes Tutorial – an ebook currently priced at $14.95 – is clearly aiming to address that gap.

The book has a simple structure: it is, in a nutshell, a tour around the various ‘modules’ that you combine to make a pipe.

Some of these will pull information from elsewhere – RSS feeds, CSV spreadsheets, Flickr, Google Base, Yahoo! Local and Yahoo! Search, or entire webpages.

Some allow the user to input something themselves – for example, a search phrase, or a number to limit the type of results given.

And others do things with all the above – combining them, splitting them, filtering, converting, translating, counting, truncating, and so on.

When combined, this makes for some powerful possibilities – unfortunately, its one-dimensional structure means that this book doesn’t show enough of them.

Modules in isolation

While the book offers a good introduction into the functionality of the various parts of Yahoo! Pipes, it rarely demonstrates how those can be combined. Typically, tutorial books will take you through a project that utilises the power of the tools covered, but Pipes Tutorial lacks this vital element. Sometimes modules will be combined in the book but this is mainly done because that is the only way to show how a single module works, rather than for any broader pedagogical objective.

At other times a module is explained in isolation and it is not explained how the results might actually be used. The Fetch Page module, for example – which is extremely useful for scraping content from a webpage – is explained without reference to how to publish the results, only a passing mention that the reader will have to use ‘other modules’ to assign data to types, and that Regex will be needed to clean it up.

Continue reading