How Wayback Machine and a sitemap file was used to factcheck Dominic Cummmings

Before and after images of Cummings' blog post text

Among the many claims made by UK Government adviser Dominic Cummings in his press conference on Monday was one that could be easily checked.

As evidence that he took the threat from coronavirus seriously he said that he’d written about the danger of coronaviruses last year.

“For years I’ve warned of the dangers of pandemics. Last year I wrote about the possible threat of coronaviruses and the urgent need for planning.”

Before the press conference was over, that claim had already been proven to be false, thanks to some underused journalistic tools of verification: the Wayback Machine and sitemap.

Here’s how it was done — and how journalists can use the same tools in their work, whether it’s to verify a claim made about the past, a claim about what was not said in the past, or to uncover details that may have been unwittingly revealed in earlier versions of webpages.

1. Using search operators to find the page(s) you need to check

site:dominiccummings.com coronavirus

First, it was important to establish where any mentions of coronavirus might have been made on Dominic Cumming’s blog.

The site: search operator is the most obvious way to do this: put the domain (in this case dominiccummings.com) after site: (with no space in between) followed by the keyword(s) that you’re looking for, e.g.:

site:dominiccummings.com coronavirus

That particular search throws up just three results, and only one of those is a blog post (the category page and homepage merely contain the same post).

Sure enough, it does mention coronavirus. But blog posts and webpages can be edited or backdated, so how do we see whether that’s happened here?

2. Using Wayback Machine to see whether the page was stored in the past — and what it looked like

Wayback Machine view for Cummings website

Years are shown across the top; a calendar underneath will have days when a snapshot was taken highlighted in blue

You can put that URL into the Wayback Machine‘s search box to see if it has any snapshots stored, and how far back those go.

If it has taken any snapshots, it will provide a timeline which allows you to navigate between different years (across the top). Underneath is a calendar for that year with any days when snapshots were taken highlighted in blue.

For Cummings’s post we can see that Wayback Machine’s earliest snapshot is March 31, 2019 — about 3 weeks after the datestamp, appearing to confirm that it was written then. However, we also need to click on that earliest snapshot to check what it looked like.

We can also see that 27 snapshots have been taken of the post between that one and May 27, 2020. To check the last snapshot of this post before the pandemic, we can navigate to 2019 and click on one of the later dates there — October 28, for example, when two snapshots were taken.

Using Edit>Find in the browser (or CTRL+F) we can search for the word coronavirus. There are no results: it’s not in the page.

We can then repeat this process for later entries to see when the reference to coronavirus was added.

3. Using Changes to compare differences

Even better, we can use Wayback Machine’s Changes feature to show us all the differences between two snapshots.

If you navigate back to the calendar view in Wayback Machine you can see the Changes button across the top.

The Changes view displays snapshots in a calendar too, but this time covering multiple years and with colour indicating the degree of ‘variation‘ between each snapshot — so it is a more useful way of navigating snapshots.

The Changes view of Wayback Machine: a calendar with a heatmap showing how many changes were made

The Changes view presents snapshots in a grid with colour indicating the degree of changes made between snapshots

When viewed using Changes, we can see that some changes were made on May 3 2020, as well as Dec 17 and July 1 2019.

Once you click on one of those snapshots the legend at the bottom changes from variation to ‘Distance‘ and the colours of all the other snapshots change to show how different each is from that snapshot.

If you then click on a second snapshot you can now see the dates of both snapshots in the two upper corners of the screen — blue on the left and orange on the right.

The ‘Compare‘ button between those, above the calendar, now becomes active.

Selecting snapshots in the Change view, with a Compare button active

Clicking the Compare button will load a new page with the two snapshots side by side. Any additions between the two versions will be highlighted in yellow (on the left, below) and deletions will be shown in blue (on the right).

Some of these differences will be normal or unremarkable — the ‘recent posts’ section has changed, for example — but scrolling down you will find passages where text has been changed.

New text added in yellow

The snapshot on the left is more recent than the one on the right, and text has been added.

This, then, is where we can see the reference to coronavirus being added.

It’s important to emphasise that this does not mean that the addition was made on May 3rd — all we can say at this point is that the change was made at some point between October 28th and May 3rd.

We can narrow that date range further, however, by using the dropdown menus above each snapshot, allowing you to change year, month, and date: note that the date dropdown will only show options in the currently selected month.

if we move the righthand snapshot forward to 2020, then the month dropdown to April, and then use the date dropdown to select the latest available snapshot from that month (April 9th), we can compare that to the May 3rd update on the left.

Make sure to click Show differences once more to run that comparison.

We can now see that the change is still being shown — meaning that it was made between those two dates.

4. Using sitemap to pin down a last modified date

The site’s sitemap file can give us more information. On WordPress sites you can view the sitemap file by adding /sitemap.xml to the end of the homepage URL, i.e.

https://dominiccummings.com/sitemap.xml

Although those familiar with WordPress will recognise Cummings’s site as using that platform, many websites run on WordPress content management systems less obviously, so it’s always worth trying.

It’s important to view that sitemap.xml URL in a browser that’s able to handle XML files — Firefox and Chrome, for example, rather than Safari or Internet Explorer.

The XML file won’t be pretty — but it will be searchable, and that’s what really matters here.

You should now be able to search the sitemap.xml file for the URL of the content you’re interested in.

The XML searched for the URL, with one match highlighted

There is a match — in fact, it’s right at the top of the file.

What does this tell us? Well the sitemap.xml file contains data about the pages on the site. The XML format means that the data is structured as a series of branches — one for each URL. Each branch starts with the <url> tag and ends with the closing </url> tag.

Indented and contained within each <url> tag are different pieces of data about that URL. In this case, there are four:

Clearly the<lastmod> tag is what we are interested here: it says that the page at that URL was last modified on the 14th April.

Specifically, it was modified on 2020-04-14T20:55:20+00:00: that is, the year 2020, in the fourth month (April), 14th day, at 20:55 and 20 seconds. The +00:00 part refers to Coordinated Universal Time (UTC) – basically Greenwich Mean Time, in other words there is no difference between that and local time.

 

5. Google tools: searching for other pre-pandemic mentions of coronavirus by Cummings

So far we’ve focused our investigation on Cummings’s own blog — but we might want to widen our scope to see if he made any reference to coronaviruses anywhere online before the current pandemic.

We can do this — at least as far as Google’s database goes — by using Google’s Tools functionality to specify the timeframe that you want to search.

Start by doing your search as normal on Google — “Cummings coronavirus” for example. As expected, your results will be dominated by recent news stories and material related to the current pandemic.

It would be very difficult to sift through these to find anything pre-dating the pandemic.

Click on the Tools button under the search bar and to the right.

Google search bar with Tools clicked and Any time selected

A second level of options now opens up, which includes “Any time“.

Click on that to open up a dropdown where you can select a timeframe. The last option is Custom range…

This will open up a calendar where you can specify a start and end date for your custom range: in this case we might choose a range from January 1 2010 to December 1 2019.

Note that this date range is the dates on which Google indexed the page, not the date on which it was necessarily published — and, crucially, updated.

The results from this search demonstrate that weakness very well: profiles of Cummings written in previous years, for example, but since updated to include his actions during the pandemic. Or topic pages created previously and updated in a similar way with new articles. Or old articles which have links to newer ones in the side navigation.

However, you can sometimes wade through that, or adapt your search with search operators to filter it further, in order to find relevant material.

6. A stronger position to question from

All of this provides you with a basis for questioning, further exploration (what happened on April 14?) and reporting (here, for example, is the BBC article on the false claim the day after, where No. 10 was approached for a response) — but it certainly puts you in possession of more facts than would otherwise be the case.

1 thought on “How Wayback Machine and a sitemap file was used to factcheck Dominic Cummmings

  1. Pingback: Wayback Machine Compare: An Ideal Digital-Professional Portfolio Tool – Cal schol.com

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.