When you get data in sentences: how to use a spreadsheet to extract numbers from phrases

Unduly lenient sentences review scheme inadequate

This BBC story involved converting phrases into numbers that could be used in calculations

Earlier this month the BBC Data Unit published a story on unduly lenient sentences which involved working with data that was trapped in phrases.

We needed to be able to take a collection of words such as “11 years and 5 months’ imprisonment” and convert that into something that could be used in spreadsheet calculations (specifically, comparing the lengths of time represented by two different phrases).

It’s a problem you come across every so often as a journalist — especially with FOI requests — so in this post — taken from the book Finding Stories in Spreadsheets — I’ll explain how to do that.

First, here’s what the data looks like:

Original Sentence	Revised Sentence
9 years’ imprisonment	11 years and 5 months’ imprisonment
2 years’ imprisonment suspended for 2 years	Sentence Unchanged
12 years’ imprisonment	14 years’ imprisonment
22 months’ imprisonment	Sentence Unchanged

(You can find the data and other background in a GitHub repo here.)

Break down the steps

The first thing to do in any situation like this is break down the task into its constituent problems/challenges. Well, we need to:

Identify where the number of years is stated
Extract that number of years
Identify where the number of months is stated
Extract that number of months
Identify other context like ‘suspended’, ‘minimum’ etc.
Convert years and months to a number
Convert years and months to a common measure (total months)

Identify where the years/months are detailed: using `SEARCH`

The function SEARCH will tell you where the first mention of a word appears. It’s case-insensitive:

=SEARCH("year",G14)

Translation: Search for where year appears in I2

This (written in cell I2) returns a position, e.g. 4. If it doesn’t find it, it returns #VALUE

We assume that the number of years appears 3 positions before that (one space, plus two digits). So we subtract 3 to get the position of the number of years.

=search("year",G14)-3

Extract the number of years/months (and correct for problems)

If it’s a single figure, we should get just the space before it, which we will deal with below. But if it’s at the start of a line, we won’t, and will get an error instead.

So we need a new column (L) to correct for that:

=IF(I14=0,1,I14)

Now nest that within a MID function to extract the numbers at that position:

=MID(H14, IF(I14=0,1,I14),2)

This grabs characters from that position, and continues grabbing for 2 characters.

However, this might also grab spaces if only one character is a digit.

Handling an unnecessary space

To correct that unnecessary space, and make sure the result is formatted as a number, nest this again in an INT function (which turns a value into an integer):

=INT(MID(I2, IF(K2=0,1,K2),2))

We could use the TRIM function, which removes trailing and leading white space, but this would not change the type of data that the space surrounds.

Note that this process involves a lot of trial and error: our initial formula works for most cells, but we then focus on error handling with the cells where it does not.

You can adapt and repeat the above two processes for months, weeks and days – and also for the revised years, months and so on – to get the numbers you need.

Converting to a common measure

Once you have columns extracting the numbers of years, months, weeks and days you need to be able to convert them to a common measure in order to perform any calculations. How much longer, for example, is 1 year and 2 months, than 0 years and 5 months?

The simple way to do this is to convert all measures to the lowest common denominator.

In this case that would be days, so:

Multiply the years figures by 365 to get days (we could use 365.25 to account for leap years, but it’s not relevant)
Multiply the months by 31 (again, we don’t need to use a decimal to account for the fact that month lengths vary – we only need a consistent figure that allows us to calculate differences)
Multiply the weeks by 7

Once this is done, we can add up all those figures for each sentence to get a total number of days for that sentence.

Repeat for the revised sentence and you now have two figures that you can compare to get a new figure: the difference between the two sentences.

That difference can now be averaged (for example, the average change in sentence by offence), and converted accordingly (years, months and weeks) if needed.

Manual cleaning: identifying unusual words

It’s worth adding a column which just measures the length of the description cell, so you can sort it and pick out outliers:

=LEN(I2)

The longest cell on that basis is this one:

“4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months”

These unusual entries may contain more than one mention of the word being searched for (“years”) or extra caveats which are important to factor in.

You decide to add a column checking for those key words:

=COUNTIF(I2,"*licence*")

This will count 1 if ‘licence’ is in that cell, or 0 if it is not. The asterisks are wildcards which mean ‘any or no characters’.

Adding a ‘checking’ formulae

You can use the same function to check whether a sentence mentions years or months:

=COUNTIF(I2,"*year*")

Any 0 value should also match any #VALUE error in your earlier SEARCH formula that looked for “year”.

You can use filters to check that they all do, and investigate any that don’t.

This may lead you to extra cleaning steps, or in most cases you might decide it is quicker to manually correct or clarify those entries.

Sometimes hard work ends up left out of the story

Despite all this work to extract the numbers from the data, in the end it was decided to leave this dimension out of the eventual story, as it became clear that the story was going to focus on the proportion of requests which were not eligible for review.

It’s important to mention this because often in journalism — not just data journalism — you have to be prepared to leave out material because the focus of your story has changed, regardless of the work that went into it. Ultimately it doesn’t matter how much work something involved: if it’s not central to the story, then it shouldn’t be there.

UPDATE: Tony Hirst used this post as the basis for an exploration of the same techniques in Python using the Python language quantulum3 (see also the Comments)

@paulbradshaw wrt When you get data in sentences: how to use a spreadsheet to extract numbers from phrases, it intrigued me so I had a quick play…https://t.co/VBk1Gi2Obh #ddj

— Tony Hirst (@psychemedia) August 6, 2019

4 thoughts on “When you get data in sentences: how to use a spreadsheet to extract numbers from phrases”

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

When you get data in sentences: how to use a spreadsheet to extract numbers from phrases

Break down the steps

Identify where the years/months are detailed: using `SEARCH`

Extract the number of years/months (and correct for problems)

Handling an unnecessary space

Converting to a common measure

Manual cleaning: identifying unusual words

Adding a ‘checking’ formulae

Sometimes hard work ends up left out of the story

4 thoughts on “When you get data in sentences: how to use a spreadsheet to extract numbers from phrases”

Leave a comment Cancel reply

Break down the steps

Identify where the years/months are detailed: using SEARCH

Extract the number of years/months (and correct for problems)

Handling an unnecessary space

Converting to a common measure

Manual cleaning: identifying unusual words

Adding a ‘checking’ formulae

Sometimes hard work ends up left out of the story

Share this:

Related

4 thoughts on “When you get data in sentences: how to use a spreadsheet to extract numbers from phrases”

Leave a comment Cancel reply

Identify where the years/months are detailed: using `SEARCH`