Tag Archives: excel

How to combine two datasets to put a story into context (book extract)

One of the most common challenges in a data-driven story is combining two sets of data — such as events and populations — to put a story into context. In an extract from the ebook Finding Stories in Spreadsheets, I explain how to use lookup functions to combine two tables. The longer ebook version of this tutorial includes a dataset and exercise to employ these techniques.

Combining data is often a great way of telling new stories about spreadsheets. For example: you may have one table showing pass rates for each school in an area, and another table showing their addresses. Combining these would allow you to identify geographical patterns, or to place them on a map.

You could also combine the addresses with poverty rates for different locations, or unemployment to see if there’s a possible relationship (remembering that correlation does not equal causation), or to identify the schools performing particularly well despite local conditions. In the video below, for example, I walk through an example of combining data on different sports teams’ attendances with data on their rankings, allowing you to see who’s attracting large crowds despite their poor performance.

The VLOOKUP function is one of the most widely-used tools in combining data in this way. It stands for Vertical lookup, and means that the spreadsheet will look up and down a column (i.e. vertically) for whatever you ask it. In more recent versions of Excel the XLOOKUP function has been introduced to make the process easier — but the process is similar for both.

Continue reading

What is dirty data and how do I clean it? A great big guide for data journalists

Image: George Hodan

If you’re working with data as a journalist it won’t be long before you come across the phrases “dirty data” or “cleaning data“. The phrases cover a wide range of problems, and a variety of techniques for tackling them, so in this post I’m going to break down exactly what it is that makes data “dirty”, and the different cleaning strategies that a journalist might adopt in tackling them.

Four categories of dirty data problem

Look around for definitions of dirty data and the same three words will crop up: inaccurate, incomplete, or inconsistent.

Dirty data problems:
Inaccurate: Data stored as wrong type; Misentered data; Duplicate data; abbreviation and symbols.
Incomplete: Uncategorised; missing data.
Inconsistent: Inconsistency in naming of entities; mixed data
Incompatible data:  Wrong shape;
‘Dirty’ characters (e.g. unescaped HTML)

Inaccurate data includes duplicate or misentered information, or data which is stored as the wrong data type.

Incomplete data might only cover particular periods of time, specific areas, or categories — or be lacking categorisation entirely.

Inconsistent data might name the same entities in different ways or mix different types of data together.

To those three common terms I would also add a fourth: data that is simply incompatible with the questions or visualisation that we want to perform with it. One of the most common cleaning tasks in data journalism, for example, is ‘reshaping‘ data from long to wide, or vice versa, so that we can aggregate or filter along particular dimensions. (More on this later).

Continue reading

What are regular expressions — and how to use them in Google Sheets to get data from text

In an extract from a new chapter in the ebook Finding Stories in Spreadsheets, I explain what regular expressions are — and how they can be used to extract information from spreadsheets. The ebook version of this tutorial includes a dataset and exercise to employ these techniques.

The story was an unusual one: the BBC Data Unit had been given access to a dataset on more than 200,000 works of art in galleries across the UK. What patterns could we find in the data that would allow us to tell a story about the nature of the nation’s paintings?

Some of the data was straightforward to work with: the ‘artist’ column was relatively clean, and allowed us to identify the most common male and female artist. It turned out that the latter – the Victorian botanist Marianne North – was relatively unknown. So, that was one story we could tell.

ukart

But other parts of the data were more problematic. The date column, for example, contained inconsistently formatted data: in the majority of cases a specific year had been entered, but in many others the data contained text such as “18th century” or “1900-1920” or “1800s”.

We also noticed that monarchs featured heavily in the art – but understandably there was no column that was specifically dedicated to classifying those. If we wanted to identify the most-painted monarchs we would have to create new data that somehow extracted those names from the paintings’ titles.

These problems – extracting data from existing data, particular text data – are what regular expressions are designed for. In this chapter I will explain what regular expressions are, and how to use them in spreadsheets.

Continue reading

How to: uncover Excel data only revealed by a drop-down menu

Sometimes an organisation will publish a spreadsheet where only a part of the full data is shown when you select from a drop-down menu. In order to get all the data, you’d have to manually select each option, and then copy the results into a new spreadsheet.

It’s not great.

In this post, I’ll explain some tricks for finding out exactly where the full data is hidden, and  how to extract it without getting Repetitive Strain Injury. Here goes…

The example

fire data dropdown

To get the data from this spreadsheet you have to select 51 different options from a dropdown menu

The spreadsheet I’m using here is pretty straightforward: it’s a list of the populations for each fire and rescue authority in the UK (XLS). These figures are essential for putting any story about fires into context (giving us a per capita figure rather than just whole numbers) — and yet the authority behind the spreadsheet has made it very difficult to extract those numbers. Continue reading

How to: analyse your Twitter or Facebook analytics for the best days or times to post

Twitter’s analytics service is a useful tool for journalists to understand which tweets are having the biggest impact. The dashboard at analytics.twitter.com provides a general overview under tabs like ‘tweets’ and ‘audiences’, and you can download raw data for any period then sort it in a spreadsheet to see which tweets performed best against a range of metrics.

However, if you want to perform any deeper analysis, such as finding out which days are best for tweeting or which times perform best — you’ll need to get stuck in. Here’s how to do it. Continue reading

How to: fix spreadsheet dates that are in both US and UK formats

640px-Date_format_by_country.svg

This map by Artem Karimov shows which countries use which data formats

It’s quite common when working with Google Sheets to have data set to US format (Month-Day-Year) without realising it. This is because Google will format your dates based on what ‘locale’ or language you have set – and the default is US English.

Instructions on how to change that are here – but what if it’s too late? What if you’ve already inputted or imported data which, when updated to a different format, will make it the wrong date? Continue reading

Mapping tip: how to convert and filter KML into a list with Open Refine

Original image by GotCredit/Flickr
Original image by GotCredit/Flickr

If you are working with map data that uses the shapes of regions or countries, chances are you’ll need to work with KML. In this guest post (first published on her blog) Carla Pedret explains how you can use the data cleaning tool Open Refine to ‘read’ KML files in order to convert them into other formats (for example to grab the names of places contained in the file).

KML (Keyhole Markup Language) is the default format used by Google’s mapping tool Fusion Tables (Google bought the company which created it in 2004), but it is also used by other mapping tools like CartoDB.

The open source data cleaning tool Open Refine can help you to open, process and convert KML files into other formats in order to, for example, match two datasets (VLOOKUP) or create a new map with the information of the KML file.

What is the difference between XML and KML?

In this post, you will learn how to convert a KML file into XML and download it as aCSV file.

XML – Extensible Markup Language –  is a language designed to describe data and it is used in RSS systems.

XML uses tags like HTML, but there is a big difference between both languages. XML defines the structure of the information, whereas HTML focuses on other elements too, including their meaning and arrangement (even when it is not supposed to focus on appearance), and the importing of other code and media.

KML – Keyhole Markup Language – documents are XML files specific for geographical annotations. KML files contain the parameters to add shapes to maps or three-dimensional Earth browsers like Google Earth.

The big advantage of KML files is the users can customize the maps according to their data and without knowing how to code.

Image of a KML map in Google Fusion Tables
Image of a KML map in Google Fusion Tables

Convert a KML file to XML

You can find KML files using Google Tables search (make sure you have ‘Fusion Tables’ secleted on the left).

Type what you are searching for and add the word geometry or KML.

Captura de pantalla 2016-02-28 a las 19.59.01

Open the fusion table and check that it has shapes by looking for a ‘map’ view (normally this has its own tab).

You should be able to download the KML when looking at that map view by selecting File > Download.

Once downloaded, to convert the file, upload your KML in Open Refine (download Open Refine here) and click Next.KML 1

In the blue box under your data, select XML files.

KML2

Now in the preview you can see the XML file with the structure of the information.

If you want to create a map with your own data and the shapes in the KML file, you need to match the KML with your data.

The example I have used contains the shapes of local authorities in the UK. I want to match the shapes in one dataset (the KML file) with information in another dataset on which party runs each council.

The element both datasets have in common (and therefore the element which will be used to combine them) is the name of the councils. But you need to check that those elements are the same: in other words, are the councils named in exactly the same way in both datasets, including the use of ampersands and other characters?

Have a look at the XML preview and try to find the tags that contain the information you need: in this case, authority names. In the example the tags containing the authority name are <name></name>.

Hover over that element so that you get a dotted box like the one shown below. Click on that rectangle and wait until the process has finished.
Captura de pantalla 2016-02-28 a las 20.29.06

You should then see a column or columns as the picture shows.

Captura de pantalla 2016-02-28 a las 20.33.07

On the right hand side of the page, change the name of your file and click on Create a new project.

Once created, you now only need to export it. Click on Export and select the format you prefer.

KML 5

What originally was a KML file is now a filtered list with data ready to check and match against your other dataset.

Do you use Open Refine? Leave a comment with your tips and techniques or send it to me at @Carlapedret..

Continue reading

How to: calculate or find rankings in spreadsheets using RANK, LARGE and SMALL

The ebook version of this tutorial includes a dataset and exercise to employ these techniques.

Right at the start of my book on Excel for journalists I talk about sorting data to find out which values come top or bottom. However, there is a family of functions which will give you a lot more control in finding out not just who is top or bottom, but the rank of any value in any series of values.

This is particularly useful if you want to compare ranks.

Pakistan ranking story

Many stories are based on finding out where your own country or region ranks in the latest data

Consumer ranking story

Ranking isn’t just about statistics – it can be used in consumer stories too

For example, say you had a table showing school performance across the last two years.

Each table shows the percentage of pupils achieving the top grades in that year. You can use RANK to find out what rank each percentage would have placed the school in for each year. Continue reading

HOW TO: Find out the ages of people using Excel

excel for journalists ebook

This post is taken from the ebook Finding Stories With Spreadsheets

“How do I calculate an age in Excel?” Marion Urban, a French journalist and student on the MA in Online Journalism in Birmingham, was preparing data for the forthcoming UK General Election.

In order to do this Marion had downloaded details on the candidates who had stood successfully in the previous election.

“It was a very young intake. But it wasn’t easy to calculate their ages.”

Indeed. You would think that calculating ages in Excel would be easy. But there is no off-the-shelf function to help you do so. Or at least, no easy-to-find function.

Instead there are a range of different approaches: some of them particularly, and unnecessarily complicated.

In this extract from Finding Stories in Spreadsheets I will outline one approach to calculating ages, which also illustrates a useful technique in using spreadsheets in stories: the ability to break down a problem into different parts. Continue reading

Finding Stories in Spreadsheets – ebook now live!

Finding stories in spreadsheets book cover

Cover design by Matt Buck/Drawnalism

My latest ebook – Finding Stories in Spreadsheets – is now live on Leanpub.

As with Scraping for Journalists, I’m publishing the book week-by-week so the book can be updated based on reader feedback, user suggestions and topical developments.

Each week you can download a new chapter covering a different technique for finding stories, from calculating proportions and changes, to combining data, cleaning it up, testing it, and extracting specific details.

There’s also a downloadable spreadsheet at the end of each chapter with a series of exercises to practise that chapter’s technique and find particular stories.

Along the way I tackle some other considerations in telling the story, such as context and background, and the importance of being specific in the language that you use.

If there’s anything you’d like covered in the book let me know. You can also buy the book in a ‘bundle’ with its sister title Data Journalism Heist, which covers quick-turnaround techniques for finding stories in spreadsheets using pivot tables and advanced filters.