Double counting: how to spot it and how to avoid it

Double counting — counting something more than once in data — can present particular risks for journalists, leading to an incorrect total or proportion. Here’s how to spot it — and what to do about it.

Look at the following chart showing the gender of teachers in UK schools, based on data on teacher headcounts. Notice anything wrong? (There are at least two problems)

Pie chart: Sex of teachers in UK schools
There are three visible slices: male, female and 'total', which takes up more than half of the pie.

The most obvious problem is that the chart appears to be ‘comparing apples with oranges’ (things that aren’t comparable). Specifically: “male”, “female”, and “unknown” are similar categories which can fairly be compared with each other, but “total” is a wider category that contains the other three.

I’ve used a pie chart here to make it easier to spot: we expect a pie chart to show parts of a whole, not the whole as well as its parts.

But the same problem should be obvious from the same data in a table before visualising it:

Table showing headcount of teachers in the categories: female, male, unknown, total, plus a grand total of all four at the bottom

The table shows us that we have both a “Total” and a “Grand Total”. This is a red flag. There can only be one total, so if there’s more than one that’s a strong sign of double counting.

Why is this happening? We need to take a look at the data.

Spotting ‘apples and oranges’ problems in data

The data comes from the School workforce in England dataset — specifically a file called ‘workforce_teacher_characteristics_2010_2024_nat_reg_la.csv’ (there are multiple files in the dataset).

It measures a number of different things about the school workforce, including sex, ethnicity, age group and grade — and it provides classifications by different types of geography and time period. There are over 420,000 rows in that file.

A common approach when presented with such detailed data is to create a pivot table. This allows you to simplify the data by aggregating it into a table showing totals by a particular category. That’s how the table shown above was created.

In a pivot table, whatever measure you want to group by (in this case the sex of teachers) goes in the Rows area of the pivot table.

Whatever you want to total for each of those categories goes in the Values area of the pivot table (in this case there is a column called ‘headcount’ that we want to ‘sum’ for each sex)

And that’s what I did to get the data for the chart.

But because the ‘sex’ column in this data looks at different levels of categorisation, we need to make sure that we filter the data to remove non-comparable categories.

In a pivot table you can do that by using the dropdown menu at the top of your rows to remove any categories/rows not relevant to the analysis. In this case, removing the ‘Total’ category. If we update our chart with that data, it now looks like this:

Table showing headcount of teachers in the categories: female, male, unknown, plus a grand total of all three at the bottom — Note the dropdown next to Row Labels, which has been used to filter the rows to remove ‘Total’

When the chart is updated with the new filtered data, it looks like this:

Pie chart: Sex of teachers in UK schools
There are two visible slices: male (just under a quarter) and female (three quarters). The number of female teachers is 349,218,274

Counting across multiple measures

It doesn’t look so obviously problematic — but look at the figures and ask yourself if they seem credible to you.

According to the chart, there are almost 350 million female teachers in UK schools.

To put that into context, consider (or find out) what the entire UK population is.

Spoiler alert: it’s less than 350 million people. So it’s impossible for there to be that many teachers.

What’s happening here? It’s another red flag that there is probably more double counting that we need to fix.

At this point we need to look at the original data and try to understand it better. (This is a good lesson in the importance of exploring and understanding the data properly before doing any analysis, as this will prevent many mistakes from happening in the first place).

Specifically, we want to identify where any double counting might exist in the data. Here is a list of the columns:

time_period
time_identifier
geographic_level
country_code
country_name
region_code
region_name
old_la_code
new_la_code
la_name
school_type
grade
sex
age_group
working_pattern
qts_status
on_route
ethnicity_major
ethnicity_minor
full_time_equivalent
headcount
fte_percent_within_ethnicity
headcount_percent_within_ethnicity

A useful tip when exploring data is to turn on the Filter option (in the Data menu). This will create a dropdown menu at the top of each column. When you click on any dropdown you will see a list of the unique values in that column. So, for example, clicking on the dropdown for the “sex” column would have shown you that there was a “Total” category alongside the other three.

When we do that for each column, we can see most have opportunities for double counting:

time_period: the data covers a number of years, so any analysis has the potential to add up all the teachers in all of the years
geographic_level: the data covers local, regional and national levels. That means teachers will be counted three times, at each of those levels.
school_type: there are a number of levels here, such as nursery and primary schools being counted separately and jointly. You need to pick just one category (‘Total state-funded schools’ is the top level) to get all schools without including sub-categories that introduce double counting.
grade: there are four different grades, from head teacher to assistant, and once again a ‘Total’ category which counts those four grades all together. To avoid double counting we need to filter out the total, or just pick that alone.
age_group: the same problem exists with this column and the next five:
working_pattern
qts_status
on_route
ethnicity_major
ethnicity_minor

Measures with potential for double counting

In total there are 10 columns with opportunities for double counting. We can classify these as follows:

Any column that uses a ‘Total’ category (eight of our columns, including school_type which uses ‘Total state-funded schools’)
Any column that uses categories and sub-categories that fall inside those (e.g. in geographic_level a nation contains regions that contain local authorities).
Any column that uses different time periods.

To properly avoid double-counting we need to filter on all the columns where double counting can occur.

There are two main ways to do this: by using the pivot table’s filtering options; or by filtering the original dataset first*, and making a copy of that subset (pivoting further if needed).

Here is what the pivot table filtering option looks like when putting those 10 columns into the Filter box, and then setting each of those filters to remove double counting:

Table showing headcount of teachers in the categories: female, male, unknown, plus a grand total of all three at the bottom. Above that table are 10 rows of filters.

Those figures look more believable, totalling around half a million teachers.

Here’s the final chart using those figures that avoid double counting:

Pie chart: Sex of teachers in UK schools
There are two visible slices: male (just under a quarter) and female (three quarters)

The moral of the story

The key thing to remember here isn’t how to deal with school workforce data in particular — it’s to incorporate a number of processes into your reporting when dealing with data:

Check whether your data is ‘comparing apples with oranges’, i.e. putting categories in the same column that are not directly comparable, such as regions with nations.
Check for columns that use ‘Total’ or ‘All’ as a category alongside sub-categories like specific age ranges, pay grades, etc.
Use filters to ensure you are only calculating with categories at the same level. The simplest way to do this is to select the top-level category (‘Total’, ‘All’, ‘National’ etc) and deselect everything else.
Sense-check the draft figures or charts: does that figure sound right when compared to what you would expect, e.g. when compared to the population as a whole? Do the categories make sense next to each other?
Look out for figures which add up to a neat 50% of the whole, or half of the pie: that’s a big clue that they represent the total of everything else.

*Note: I mentioned that there were two ways to filter out double counting. The second way is to use the same filters in the main data itself, instead of in the pivot table — in fact, if the data uses ‘Total’ as a category it often means there’s no reason to use a pivot table at all, because they’ve already aggregated the figures (which is what a pivot table does) as one of the measures. See how to do this in the video below, which also walks through the steps outlined in the post above.

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

Double counting: how to spot it and how to avoid it

Double counting — counting something more than once in data — can present particular risks for journalists, leading to an incorrect total or proportion. Here’s how to spot it — and what to do about it.

Spotting ‘apples and oranges’ problems in data

Counting across multiple measures

Measures with potential for double counting

The moral of the story

Leave a comment Cancel reply

Double counting — counting something more than once in data — can present particular risks for journalists, leading to an incorrect total or proportion. Here’s how to spot it — and what to do about it.

Spotting ‘apples and oranges’ problems in data

Counting across multiple measures

Measures with potential for double counting

The moral of the story

Share this:

Related

Leave a comment Cancel reply