How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful.

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming.

Introducing Regex

One of the most powerful tools in a scraper’s arsenal is its ability to look for a particular pattern of words, digits or characters that recurs. In previous chapters we’ve been very specific about that pattern – but what if you can’t be that specific?

If that’s your problem then ‘regex‘ – short for ‘regular expression’ – may be one solution.

Regex looks like a jumbled bunch of nonsense, but is in fact just a code.

Some of it makes sense, like this:

[a-z]

Which means ‘any lower case character from a to z’.

And this:

[0-9]

Which means ‘any character from zero to nine’.

And some of it makes no sense at all when you first see it. But hopefully that should change by the end of this chapter.

To show how it works, we’ll start with using it to specify a range of characters.

Using regex to specify a range of possible matches

Let’s look at the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries – at http://www.labour.org.uk/list-of-meetings-with-donors-and-general-secretaries,2012-03-30

Let’s say you’re interested in any entries from June or July 2011.

Launch OutWit Hub, open up that URL, and go to the scrapers section.

Click New to create one. Give it a name.

In the first line of this new scraper, give it a description of ‘June July 2011’ and in the Marker before section type this regular expression:

/Ju[nl][ey]/

What does this mean? Firstly, the forward slashes at the start and end tell OutWit that this is regex. These are what is called an identifier, but don’t worry about the jargon.

Our regex proper then starts with two letters, and a further pair of letters inside some square brackets. Everything within square brackets represents character options. In other words, if there is a match for either of the lower case characters ‘n’ or ‘l’ (in the first square brackets) and that character is followed by either of the lower case characters ‘e’ or ‘y’ (in the second square brackets) then the scraper is happy.

Before those square brackets, however, we have non-optional characters which are not in square brackets: our scraper is looking for two of these: a capital ‘J’ followed by a lower case ‘u’.

Put another way, our regex is saying this:

look for ‘Ju’
followed by an n or l
followed by an e or y

In other words, any instance of ‘June’ in the webpage we’re scraping would generate a match here, as would ‘July’ – but not ‘june’ or ‘july’ because they start with lower case letters. To add these you could change your first letter J to another set of square brackets to provide two options like so: [Jj].

Because this regex has been entered in the Marker before column, this scraper will look for a match against any of those cases, and grab the data until whatever is specified in the Marker after column.

So, now enter the < character in Marker after (which would be used at the first html tag after any text), and click Execute to run the scraper

The results should look something like this:

2011 – Len McCluskey (UNITE)
2011 – John Hannett (USDAW)
2011 – George Guy (UCATT)

Now go back to your scraper and try some other variations – e.g. so that it only grabs ‘2011’ or ‘George’, or anything before ‘UNITE’.

Catching the regular expression too

The problem, of course, is that we’re missing the actual words ‘June’ or ‘July’, because we used those as the marker before what we wanted to scrape.

To grab the expression that we’re looking for, rather than just using it as a marker before, use the Format column. This specifies the format of what we’re looking for rather than what comes before or after it – and it always uses regex.

Go back, then, to your scraper, and in that Format column type the same regex as before:

/Ju[nl][ey]/

…and delete the same regex from your Marker before column.

In the marker after column enter:

2011

Now click Execute.

This time you will find the opposite problem: you now have a series of rows which go: June, June, June, July, July, and so on.

This is what’s happened: because you’ve asked for a match on anything beginning with Ju, followed by an n or l, and then followed by e or y, that’s all you’ve grabbed: the words June or July (specifically, when used with your marker after: words June or July that occur directly before ‘2011’ – other instances will not be grabbed).

We need more regex to specify what else we want.

I want any character: the wildcard and quantifiers

One of the most useful characters in regex is the full stop. This is a wildcard, meaning that it can stand for any character (just as a Joker is often used in cards) – except for line breaks and new lines.

So the following:

f.ll

…can mean ‘fill’, ‘fall’, ‘full’ or ‘fell’ (or indeed ‘foll’ or any other word – note that this would grab the ‘foll’ bit of ‘follow’ for instance: it doesn’t care whether it grabs a complete word or not – unless you tell it to care)

You can use wildcards as much as you like, so could also try:

f..l

…which would grab ‘feel’, ‘fowl’, and many other words, including the ones listed before.

If we wanted to be less fussy, then, we could rewrite our regex for ‘June or July’ as follows:

Ju..

But remember that this would also grab parts of words, such as the first four characters of ‘Justice’. However, as our scraper is also limited to matches that come before the marker ‘2011’, then that doesn’t cause a problem here.

Matching zero, one or more characters – quantifiers

This wildcard is even more powerful when combined with quantifiers. Quantifiers specify how many of a particular character you want to match: one, one or more, zero or more, or a particular number or range of numbers (such as between 2 and 4).

The plus sign, question mark and asterisk are key quantifiers if you want to be less fussy about the numbers of a particular character, e.g.:

.? – zero or one of any character
.* – zero or more of any character
.+ – one or more of any character

Note the subtle differences: the first two will work even if no character is there at all. The plus sign only finds a match if the character is there.

This can also be combined with letters and ranges like so:

e? – zero or one ‘e’
e* – zero or more ‘e’
e+ – one or more ‘e’
[0-9]+ – one or more numbers
[a-z]* zero or more lower case characters
[A-Z]? – zero or one upper case character
[a-z0-9A-Z]+ one or more lower case characters, numbers or upper case characters

Curly brackets after a character can be used to specify how many characters together you want to look for:

e{2} – two ‘e’s, i.e. ‘ee’
e{2,} – at least two ‘e’s, which might be ‘ee’, ‘eee’, ‘eeeeeeee’ and so on.
e{2,4} – between two and four ‘e’s: ‘ee’, ‘eee’ or ‘eeee’ but not ‘eeeeee’

If you want your scraper to grab more than just the month, then, you need to adapt the entry in that Format column to grab all characters after the month too. Change your regex in that column to this:

/Ju[nl][ey].*/

In the marker after column enter:

<

This should now match ‘June’ or ‘July’, followed by zero or more of any character until it hits any <. Click Execute to see if it works.

In the second part of this extract I look in more detail about techniques to use in looking for patterns, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’.