Tag Archives: regex

What are regular expressions — and how to use them in Google Sheets to get data from text

In an extract from a new chapter in the ebook Finding Stories in Spreadsheets, I explain what regular expressions are — and how they can be used to extract information from spreadsheets. The ebook version of this tutorial includes a dataset and exercise to employ these techniques.

The story was an unusual one: the BBC Data Unit had been given access to a dataset on more than 200,000 works of art in galleries across the UK. What patterns could we find in the data that would allow us to tell a story about the nature of the nation’s paintings?

Some of the data was straightforward to work with: the ‘artist’ column was relatively clean, and allowed us to identify the most common male and female artist. It turned out that the latter – the Victorian botanist Marianne North – was relatively unknown. So, that was one story we could tell.

ukart

But other parts of the data were more problematic. The date column, for example, contained inconsistently formatted data: in the majority of cases a specific year had been entered, but in many others the data contained text such as “18th century” or “1900-1920” or “1800s”.

We also noticed that monarchs featured heavily in the art – but understandably there was no column that was specifically dedicated to classifying those. If we wanted to identify the most-painted monarchs we would have to create new data that somehow extracted those names from the paintings’ titles.

These problems – extracting data from existing data, particular text data – are what regular expressions are designed for. In this chapter I will explain what regular expressions are, and how to use them in spreadsheets.

Continue reading

Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

Regular Expressions slogan t-shirt

Image by Lasse Havelund

In the second part of this extract from Chapter 10 of Scraping for Journalists I recap the basics before discussing techniques to use in looking for patterns in data, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’. You can find the first part here.

 

Continue reading

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

Regular Expressions cartoon on xkcd

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful. 

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading

Review: Yahoo! Pipes tutorial ebook

Pipes Tutorial ebook

I’ve been writing about Yahoo! Pipes for some time, and am consistently surprised that there aren’t more books on the tool. Pipes Tutorial – an ebook currently priced at $14.95 – is clearly aiming to address that gap.

The book has a simple structure: it is, in a nutshell, a tour around the various ‘modules’ that you combine to make a pipe.

Some of these will pull information from elsewhere – RSS feeds, CSV spreadsheets, Flickr, Google Base, Yahoo! Local and Yahoo! Search, or entire webpages.

Some allow the user to input something themselves – for example, a search phrase, or a number to limit the type of results given.

And others do things with all the above – combining them, splitting them, filtering, converting, translating, counting, truncating, and so on.

When combined, this makes for some powerful possibilities – unfortunately, its one-dimensional structure means that this book doesn’t show enough of them.

Modules in isolation

While the book offers a good introduction into the functionality of the various parts of Yahoo! Pipes, it rarely demonstrates how those can be combined. Typically, tutorial books will take you through a project that utilises the power of the tools covered, but Pipes Tutorial lacks this vital element. Sometimes modules will be combined in the book but this is mainly done because that is the only way to show how a single module works, rather than for any broader pedagogical objective.

At other times a module is explained in isolation and it is not explained how the results might actually be used. The Fetch Page module, for example – which is extremely useful for scraping content from a webpage – is explained without reference to how to publish the results, only a passing mention that the reader will have to use ‘other modules’ to assign data to types, and that Regex will be needed to clean it up.

Continue reading