Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

Regular Expressions slogan t-shirt

Image by Lasse Havelund

In the second part of this extract from Chapter 10 of Scraping for Journalists I recap the basics before discussing techniques to use in looking for patterns in data, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’. You can find the first part here.

 

 

3 questions: What characters, how many, where?

The basic structure now merits an overview: regex consists of the following:

  • an identifier (the forward slashes that identify it as regex)
  • a pattern (the characters we’re looking for)
  • and a modifier (how many, for example – our quantifiers)

Amit Arora puts it this way: “To read (understand) or write a regular expression pattern, you should ask the three simple questions:

  • “What characters to match ?
  • “How many characters to match ?
  • “And where to match ?”

Using regex on an ugly page

Looking at the source HTML of our page on the Labour leader’s meetings with donors and trade union general secretaries, we can see it’s a pretty horrible piece of HTML which doesn’t use list tags or even paragraphs. Instead, each new line is created with the <br /> tag – a line break. Some use two, some use one, and because it’s a common tag, it’s used elsewhere on the page too. There are 82 of them in total.

In OutWit Hub, then, create a new scraper for this page with the marker before of:

>

…and the marker after of:

<br

And click Execute. This should give you four irrelevant lines followed by 42 containing the data you want. We could decide to stop here and clean this data up elsewhere – but I’m going to use this as a way to develop some knowledge of what regex can do.

What’s the pattern?

Our data may not have much structure in its HTML code – but it does have structure in the text between tags. Let’s take the longest of the lines:

<br />4 April 2011 &ndash;John Hannett (USDAW), Michael Leahy (Community), Len McCluskey (UNITE), Paul Kenny (GMB), Gerry Doherty (TSSA), Billy Hayes (CWU)&ndash; Dinner, House of Commons<br />

This is what we have:

  • A <br /> tag.
  • A number, which can be one or two digits.
  • A space – even invisible characters are important, as we’ll see.
  • A month – twelve possibilities, aside from typos.
  • Another space.
  • Four digits: a year.
  • A space.
  • HTML code for a dash: &ndash; followed by:
  • Names – normally in the format forename-space-surname. But we mustn’t rule out the possibility of titles, middle names and extra surnames elsewhere in the data.
  • Affiliations in brackets. These are not present for all entries.
  • Commas separating multiple guests.
  • HTML code for another dash: &ndash; followed by a venue.

Listing these yourself is a useful exercise as you can then write possible regex next to each item – or refer to them as you construct a regular expression.

In fact, based on what we know already we can start to construct the beginnings of some regex to extract the data. We already know that our marker before is this:

<br />

And we know that our data begins with either one number or two – the first part of a date.

We can express this in regex as follows:

/[0-9]+/

This, broken down, says:

  • / – the identifier that says ‘this is regex’. We don’t need these in the Format column because anything in this is assumed to be regex – but it’s a good habit to use them anyway.
  • [0-9] – any digit between 0 and 9. In other words, any number. This is our pattern.
  • + – the plus sign modifier specifies not just any number but ‘one or more’.
  • / – the identifier that indicates the end of the regular expression

Save, then click Execute and look at the results. You should now have a column full of numbers – the first parts of our dates. Remember that the Format column specifies exactly what you want, and all we’ve specified so far is ‘one or more digits’ after the marker <br />

We can now continue to build our regular expression to look for the rest of the date. But to do that, we need to be able to identify invisible characters like spaces.

Matching non-textual characters

As well as simple characters and digits such as those identified above, regex can also identify spaces, punctuation, and line breaks.

How do you identify a character that doesn’t exist? By combining certain characters with a backslash (note that this is different to the more common forward slash used in web addresses and as an identifier above). Here’s a list from OutWit’s own documentation on regex:

r line break (carriage return)

n Unix line break (line feed)

t tab

f page break (form feed)

s any space character (space, tab, return, line feed, form feed)

S any non-space character (any character not matched by s)

w any word character (a-z, A-Z, 0-9, _, and certain 8-bit characters)

W any non-word character (all characters not included by w, incl. returns)

d any digit (0-9)

D any non-digit character (including carriage return)

b any word boundary (position between a w character and a W character)

B any position that is not a word boundary

If some of this doesn’t make sense, look it up. For example, what makes a Unix line break different to a standard line break? There’s an answer on Stack Overflow – in short: it’s unlikely to be useful to you.

The one we need is right there: a space. This is indicated by the following regex:

s

We can now add that to our regex to create this:

/[0-9]+s/

Already you can see why regex looks like gobbledegook to people when they first come across it: it’s easier written than read.

So breaking it down again, this says:

  • / – the identifier that says ‘this is regex’. We don’t need these in the Format column because anything in this is assumed to be regex – but it’s a good habit to get into.
  • [0-9] – any digit between 0 and 9. In other words, any number. This is our pattern.
  • + – the plus sign modifier specifies not just any number but ‘one or more’.
  • s – followed by any space character
  • / – the identifier that indicates the end of the regular expression

Now we should be able to pick up speed a bit. We have the numbers followed by a space. After the space in our data we expect the name of a month – a word.

Rather than be too clever about this, we can start by simply saying ‘any series of letters’ and if we get dodgy results we can always tweak the regex.

You can say ‘any lower case letter’ with the regex [a-z] and ‘any upper case letter’ with [A-Z]. You can combine the two with [a-zA-Z].

And of course we can use a + modifier in the same way as we did with the numbers. Adding this to the end of our current regex creates the following:

/[0-9]+s[a-zA-Z]+/

In your Format column add to your regex in the same way, and click Execute. Your scraper should now be grabbing both the number and the month, and thankfully every entry is as we hoped. So no tweaking is needed.

Cracking on, we now need to add another space before we specify an expression to represent a year. The space is, once again:

s

And the year – well, we could say four digits, or one or more digits, or even the digits ’20’ followed by two more digits. As always, let’s start simple and see what happens, and re-use the regex for ‘one or more digits’:

[0-9]+

Adding those two elements makes our regex look like this:

/[0-9]+s[a-zA-Z]+s[0-9]+/

Yep. Looks horrible. But once you’ve adapted your regex to look like that, and clicked Execute once more, you’ll see it’s worked: we have the years; no need to tweak.

In fact, we now have the whole date, which feels like a good point to take a break and look at some other parts of regex.

What if my data contains full stops, forward slashes or other special characters?

If you’re the inquisitive sort, you might have wondered what happens if you want to use regex to grab data that includes characters that regex uses for other purposes. For example, so far we’ve used a full stop to indicate ‘any character’, a question mark to indicate ‘zero or one’, forward slashes and backslashes, asterisks and plus signs, square brackets and and curly ones.

If you want to specify any of these as characters rather than as the special instructions regex normally uses them for, they need to be ‘escaped‘ – with a backslash, just as the invisible characters were above.

‘Escaping’ is basically telling the computer ‘Don’t treat this as you normally would‘. To confuse things, this works both ways. The character ‘s’, on its own, normally means ‘the letter s’ in regex. Simple. But put a backslash before it and regex knows you mean ‘any space character’.

The character ‘.’ on its own, however, normally means ‘any character’. But put a backslash before it and regex knows you mean ‘a full stop’.

The best way to remember the difference is this:

  • If the backslash comes before a letter, assume it doesn’t mean that letter
  • If the backslash comes before a symbol, assume it means that symbol, literally

Escaping is common to many other tools and languages, so it’s worth getting your head around.

This is how you escape a full stop:

.

If this wasn’t escaped, then it would be treated as a ‘special character’. Specifically, if not escaped the full stop would be treated as meaning ‘any character’. But when escaped, the full stop means ‘look for a full stop here’.

There are plenty of other special characters to escape too, including, of course, the backslash itself:

\

And the forward slash:

/

The asterisk, full stop and question mark need to be escaped:

*

+

-

As do normal, curly and square brackets:

(

)

{

}

[

]

And the dollar sign and what’s called the caret symbol:

$

^

A quicker way is to list them like so: .$*+-^(){}[]/

‘Anything but that!’ – negative matches

You may have noticed that some of the expressions above – such as S and D – are negative matches, such as not a space, or not a digit. You can specify other negative matches using the caret character – ^ – within square brackets (it has to be placed first), like so:

  • [^a-z] – any character not lowercase a to z.
  • [^A] – any character other than a capital A.

This or that – looking for more than one regular expression at the same time

The pipe symbol – | – can be used to separate two expressions that you want to match. For example, taking our regular expression to find the date:

/[0-9]+s[a-zA-Z]+s[0-9]+/

We can add a pipe symbol at the end of that (just before the last forward slash):

/[0-9]+s[a-zA-Z]+s[0-9]+|/

And then write a second regular expression in case we were looking for dates written a different way, e.g. ‘2nd July’ instead of ‘2 July’:

/[0-9]+s[a-zA-Z]+s[0-9]+|[0-9]+[nsrt][drh]s[a-zA-Z]+s[0-9]+/

(In the second regex we still start with one or more numbers, but these would be followed by ‘st’ (e.g. ‘1st’, ’21st’), ‘nd’, ‘rd’ or ‘th’. The two square brackets specify the possible first and second letters of those.)

Try that regex in your scraper – you’ll see it still picks up the dates with the first part, even though none match the second part.

Only here – specifying location

You can also use the caret symbol outside of square brackets to specify that you are looking for something at the start of a string.

To specify that you’re looking at the end of a string, you can use the $ sign.

Here are some examples of both in action:

^[aeiou].* – match any string beginning (^) with a vowel [aeiou], and all text after (.*).

.*[0-9]$ – match any string ending ($) with a number [0-9], and all text before (.*).

With all that covered, we can return to our scraper.

Back to the scraper: grabbing the rest of the data

Where were we? Oh yes, we’d written some regex in the first line of our scraper to grab the date, based on a marker before of:

>

And a Format of:

/[0-9]+s[a-zA-Z]+s[0-9]+/

Broken down, that regex means:

/ – regex follows…

[0-9] – we want to match: a digit from 0 to 9

+ – (one or more)

s – followed by a space

[a-zA-Z]+ – followed by a lowercase or upppercase letter (one or more)

s – followed by a space

[0-9]+ – followed by a digit from 0 to 9 (one or more)

/ – regex ends.

Now we can create a second line in our scraper to grab the names.

So, in that second line:

  1. Give it a Description of ‘Names’.
  2. We can use regex in the marker before and marker after columns too – so the simple option here is to copy the regex we created in the line before and use it as our Marker before/[0-9]+s[a-zA-Z]+s[0-9]+/
  3. In Marker after put: <
  4. Click Execute.

Because we’re not specifying the format in this line, we don’t need to be too specific about what we’re grabbing. All we’re specifying is that comes before and after. OutWit allows a number of combinations of these:

  • marker before and a marker after
  • marker before and a format
  • marker after and a format
  • marker before, a marker after, and a format
  • marker before
  • format

Having clicked Execute you should see two columns: Date, and Names. The names have dashes before them, which is probably best cleaned in a spreadsheet tool. But if you did want to remove them, you could change your regex to the following:

/[0-9]+s[a-zA-Z]+s[0-9]+s(&ndash;|-)/

After an extra space – s – the extra regex here uses what’s called a sub-pattern. A sub-pattern is a piece of regex within your regex within brackets:

(&ndash;|-)

This says match either &ndash; or - with the either part specified by the ‘pipe’ symbol between the two possibilities: |

We’ve used this because although most lines use the code &ndash, a couple just use a dash (-) so we need to be able to match on either.

Which dash? Negative matches in practice.

There’s a final bit of data we need to grab: in addition to the date of each meeting and the names of the people there, some entries have a venue: ‘Dinner at the House of Commons’.

Looking closer, we see that the venue is always preceded by a dash. So we need an extra line to grab that.

For Description put ‘Venue’.

As for the marker before: look in the HTML for one of the venues such as ‘Dinner at Ed Miliband’s House’ and you’ll see &ndash; – the code for an ‘en dash’. Put this as your marker before:

&ndash;

And put an opening HTML chevron as your marker after:

<

Click Execute and you’ll see mixed results: it’s grabbing a bunch of names as well as the venue, because &ndash; is also used before names.

We need to work out what’s different about the two dashes. One difference is that the first dash always comes after a date: specifically, the year. So we can specify: a dash that does not come after a number. In marker before then, type the following regex:

/Ds&ndash;/

This breaks down like so:

/ – this is regex

D – any character other than a digit

s – followed by a space

&ndash; – followed by &ndash; in the HTML code

/ – regex ends

For marker after put:

<

And click Execute

This has now worked for all but one of the entries that mention venues (April 4 2011: Dinner at the House of Commons). Why?

Try reading out the regex (or writing it down) literally as you go through the HTML code that’s not being matched:

  • Not a digit? Yes, that’s right.
  • A space? No.

That’s why the regex doesn’t create a match: there is no space before &ndash; in the code before ‘Dinner at the House of Commons’, there is a closing bracket: )

We need to adapt our regex to be a little less fussy. So, instead of s for ‘any space character’, use . for ‘any character’, which should make your regular expression look like this:

/D.&ndash;/

Click Execute and the scraper should now work.

For more on regex look at Amit Arora’s introduction to regular expressions – one of clearest you will find online. And The Bastards Book of Ruby has a good guide to Regex too – although bear in mind that there may be slight differences in the way it is used with other languages. More broadly, Regular-Expressions.info is one of the most comprehensive resources out there.

Recap

Regex is particularly intimidating to approach, and this chapter has represented a steep learning curve in trying to take it all in. As with so much code, you shouldn’t expect to learn it all by heart, but rather know what is broadly possible, where to go to remind yourself how to do it, and what techniques to use in thinking creatively around phrasing your regular expression. So here are the key points:

  • Regular expressions allow you to describe a pattern of characters – including invisible ones – in order to find a match.
  • You can be more or less strict in what you ask for: you can specify a particular character, or a range. You can specify one, or at least one, or a particular number, or a minimum and maximum, or from zero to many. You can specify anything but a particular range or type of character.
  • On paper, break up the code you’re trying to scrape to identify those patterns. Then build your regex up part by part to match each part: a date, a space, a month, and so on.
  • If you get more than one match on part of the pattern, then use negative matches (e.g. not a digit or not a space, etc.) to rule out the other match.
  • Regex consists of an identifier; a pattern; and a modifier. The first identifies it as regex, the second identifies what you’re looking for, and the last says how fussy you are about it.
  • You can use brackets to specify sub-patterns – regex within regex, such as a range of possible matches separated by the pipe symbol.
  • Finally, remember that trial and error is especially useful when using regex. Your skill is in solving the problems that many attempts will inevitably generate – not in getting it right first time.
Advertisements

13 thoughts on “Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

  1. Pingback: How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper | Online Journalism Blog

    1. Paul Bradshaw Post author

      Thanks – not sure what happened there. Anyway, I’ve now fixed it and the missing 1200 words are now included…

      Reply
  2. Tony Timmins

    Hi Paul, two great articles they have certainly helped me understand regex as related to Outwit Hub. I have been using it for 12 months mainly to extract genealogy data for my one-name study. One problem that that I have is how to extract a table followed by other essential information that might be on the web page, without doing it in two hits i.e. using tables then scraper. Does your book cover this or have you also come across this problem?
    Tony

    Reply
    1. Paul Bradshaw Post author

      Thanks Tony – yes, that problem is tackled in chapter 9 with OutWit, and later on with Python in Scraperwiki too.

      Reply
  3. Doug S

    Great post – everyone ready for the pop quiz now? Seriously, regular expressions and the flavours thereof are big topics, but extremely useful for scraping and extracting bits you need. For focused scraping/crawling I (bragging) once built a glorified regular expression engine (in C as a server) – using regular expressions to both categorise articles – and link text – to decide whether or not to follow. But to use such systems one needs to know regular expressions. They can be beasts with a picky syntax (see above) which even programmers are sometimes reluctant to use. But using them is so helpful for scraper construction. (Not all do use them, interesting enough, depends on one’s stack and tastes; some do, some don’t, for example http://rosettacode.org/wiki/Web_scraping ) I always do.

    Reply
  4. Cesar

    Hi Paul

    Thank you so much for this article I am having a great time learning a bit of regex to apply to an Outwit hub scraper.

    Here is a couple of example of the source code for two prices.

    1-$10.715410-$8.965650-$8.3826100-$7.9437500-$7.50151,000+-$7.3031

    3,000-$1.105115,000+-$1.0560

    I need to pick the high and low price out for each item. Without the quantity.
    High Price: 10.7154 Low Price: 7.3031
    High Price: 1.1051 Low Price: 1.0560
    As you can see it is an interesting task since quantities vary quite a lot, so do the number of price brackets.

    At this point I am thinking the best way will be to do after exporting the full price list.

    1-$10.7154; 10-$8.9656; 50-$8.3826; 100-$7.9437; 500-$7.5015; 1,000+-$7.3031

    It will be great to see if you have any tips!

    Thanks.

    Cesar

    Reply
  5. Pingback: How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper | Online Journalism Blog

  6. Christiaan

    Hello Paul,
    Is it possible to read only a part of the page source?
    Lets say that OutWit Hub only uses the page source between the head tags?

    Reply
  7. llamadojorge

    Hi Paul, this article has been around for quite a while and I’m almost a bit embarrassed to ask since nobody else seems to have noticed, but shouldn’t the escaped characters have a backslash in front? Or did I misunderstand that explanation completely? All I can see is a dot when there should be an escaped full stop, a s where there should be an escaped space character. That might be something WordPress automatically filters out for some reason, don’t know. Anyway, great to see an actual use case for them.

    Reply
    1. Paul Bradshaw Post author

      Yes you’re right. Thanks for pointing that out – I’m guessing those backslashes have been removed in the transition, or possibly when I exported my site.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s