In an extract from a new chapter in the ebook Finding Stories in Spreadsheets, I explain what regular expressions are — and how they can be used to extract information from spreadsheets.The ebook version of this tutorial includes a dataset and exercise to employ these techniques.
The story was an unusual one: the BBC Data Unit had been given access to a dataset on more than 200,000 works of art in galleries across the UK. What patterns could we find in the data that would allow us to tell a story about the nature of the nation’s paintings?
Some of the data was straightforward to work with: the ‘artist’ column was relatively clean, and allowed us to identify the most common male and female artist. It turned out that the latter – the Victorian botanist Marianne North – was relatively unknown. So, that was one story we could tell.
But other parts of the data were more problematic. The date column, for example, contained inconsistently formatted data: in the majority of cases a specific year had been entered, but in many others the data contained text such as “18th century” or “1900-1920” or “1800s”.
We also noticed that monarchs featured heavily in the art – but understandably there was no column that was specifically dedicated to classifying those. If we wanted to identify the most-painted monarchs we would have to create new data that somehow extracted those names from the paintings’ titles.
These problems – extracting data from existing data, particular text data – are what regular expressions are designed for. In this chapter I will explain what regular expressions are, and how to use them in spreadsheets.
It’s quite common when working with Google Sheets to have data set to US format (Month-Day-Year) without realising it. This is because Google will format your dates based on what ‘locale’ or language you have set – and the default is US English.
Instructions on how to change that are here – but what if it’s too late? What if you’ve already inputted or imported data which, when updated to a different format, will make it the wrong date?Continue reading →
Right at the start of my book on Excel for journalists I talk about sorting data to find out which values come top or bottom. However, there is a family of functions which will give you a lot more control in finding out not just who is top or bottom, but the rank of any value in any series of values.
This is particularly useful if you want to compare ranks.
Many stories are based on finding out where your own country or region ranks in the latest data
Ranking isn’t just about statistics – it can be used in consumer stories too
For example, say you had a table showing school performance across the last two years.
Each table shows the percentage of pupils achieving the top grades in that year. You can use RANK to find out what rank each percentage would have placed the school in for each year. Continue reading →
“How do I calculate an age in Excel?”Marion Urban, a French journalist and student on the MA in Online Journalism in Birmingham, was preparing data for the forthcoming UK General Election.
In order to do this Marion had downloaded details on the candidates who had stood successfully in the previous election.
“It was a very young intake. But it wasn’t easy to calculate their ages.”
Indeed. You would think that calculating ages in Excel would be easy. But there is no off-the-shelf function to help you do so. Or at least, no easy-to-find function.
Instead there are a range of different approaches: some of them particularly, and unnecessarily complicated.
In this extract from Finding Stories in Spreadsheets I will outline one approach to calculating ages, which also illustrates a useful technique in using spreadsheets in stories: the ability to break down a problem into different parts. Continue reading →
My latest ebook – Finding Stories in Spreadsheets – is now live on Leanpub.
As with Scraping for Journalists, I’m publishing the book week-by-week so the book can be updated based on reader feedback, user suggestions and topical developments.
Each week you can download a new chapter covering a different technique for finding stories, from calculating proportions and changes, to combining data, cleaning it up, testing it, and extracting specific details.
There’s also a downloadable spreadsheet at the end of each chapter with a series of exercises to practise that chapter’s technique and find particular stories.
Along the way I tackle some other considerations in telling the story, such as context and background, and the importance of being specific in the language that you use.
The book has been written in response to requests from journalists who need a book on Excel aimed at storytellers, not accountants.
Finding Stories In Spreadsheets will outline a range of techniques, including ways to find the ‘needle in the haystack’ in text data, number calculations to make stories clearer, and methods of cleaning and combining data to tell new stories, including getting data ready for maps and charts.