Journalists need their own archives. Here’s how to start one

Last week I wrote about the problem with trusting Twitter to keep a public record of all tweets. But it’s not just social networks; we can’t trust any website to keep information on our behalf.

3 recent articles highlight the problem particularly well.

Google loses interest and links rot

First up, Andy Baio, who wrote of Google’s abandonment of its archiving ambitions in ‘Never trust a corporation to do a library’s job

“After a series of redesigns, Google Groups is effectively dead for research purposes. The archives, while still online, have no means of searching by date.

Google News Archives are dead, killed off in 2011, now directing searchers to just use Google.

Google Books is still online, but curtailed their scanning efforts in recent years, likely discouraged by a decade of legal wrangling still in appeal. The official blog stopped updating in 2012 and the Twitter account’s been dormant since February 2013.”

Towards the end of last year Mario Tedeschini-Lalli wrote about how his work on the CNNitalia website was now completely gone from the internet:

“Its full four years of coverage – including 9/11 – are nowhere to be found, just like almost all of the journalism produced by myself or by the digital newsrooms I managed in the last 18 years.”

Then Alexios Mantzarlis wrote about a plugin which aimed to help prevent ‘linkrot‘ on WordPress websites.

“, which launched in 2004 now has almost 6,000 dead links. Roughly one third of all the links on Pagella Politica, the Italian fact-checking website I edited before joining Poynter, are currently broken. At the same time, trying to manually keep tabs on the state of a site’s links is too time-consuming to be feasible.”

How to create your own archive

Installing a plugin is just one approach – but you need to be able to install a plugin on your site, and it only applies to pages that you link to – not, for example, the pages you are writing.

Here, then, are some other techniques for reducing the chances that your work and research disappears from the web.

Automated archiving

In my post about Twitter I mentioned the tool IFTTT for automatically storing a copy of someone’s tweets. That same tool can be used to automatically archive documents and webpages.

For example, IFTTT has a number of recipes which will automatically save email attachments to your Dropbox, Google Drive, or other cloud storage services.

ifttt emails to dropbox

But it also has recipes which will save files at a particular URL to the same services. Here’s one which backs up Reddit posts.

save reddit posts ifttt

You can use these recipes to backup documents or webpages at any link that you share on social media. The example below uses Google Drive but you could also use OneDrive, Evernote, etc.

ifttt twitter to drive

Unfortunately the results are only given the name of the user and the timestamp, so you’ll need to use other techniques if you need to search the webpages’ contents (here’s one).

ifttt backup results

Social bookmarking services with caching features

Another option is to use a social bookmarking service like Delicious or Pinboard. I am a big fan of these services because they make a significant difference in the time needed to re-find documents and reports that you have previously seen (and indeed in some cases forgotten about).

Some of these services offer ‘archiving’ or ‘caching’ functionality, where they will store a copy of any bookmarked webpages, typically for an annual fee.

I use Pinboard, which works out at $25 per year. Historious says that caching is included in all its plans, including the free one, while Diigo includes ‘unlimited caches’ in its $5 per month ‘Standard’ package.

Pinboard claims in a now five-year-old post that those plans are not directly comparable, however, with some not working with PDFs or embedded content, so check what you need and what’s provided. They still claim that “Pinboard is the only site that stores and indexes full page content, not just HTML.”

Diigo’s plan, meanwhile, offers the facility to upload a page “even if it is dynamic or hidden behind the password protection” while “you can also capture multiple versions of the same URL at different times.”

Combine the two

Of course Pinboard, Delicious and Diigo are also channels on IFTTT, and you can use that to back up your bookmarks too. Just follow the same instructions as above with uploading URLs to Google Drive or other services.

Indeed, it may be possible to use IFTTT to create a free alternative to the paid-for caching services – but remember this will be harder to search, and will not include embedded content such as images.

Still third party

Of course, any of these recipes still rely on a third party: wherever you are storing these files, whether that’s Google Drive, Dropbox, Pinboard or Diigo.

But it does bring all your material into one place (or more than one place for extra redundancy!), and make it much more likely that you will have an opportunity to move that material if you need to: if a service closes down it’s likely they will notify users so that they can export material.

Oh, and if IFTTT closes down check out the alternative Zapier.

UPDATE: How to save backup copies of webpages and social media

Paul Myers has written two posts on saving online webpage evidence and archiving evidence from social media platforms. Both are well worth a read for a more basic introduction to this area.



5 thoughts on “Journalists need their own archives. Here’s how to start one

  1. Mario Tedeschini-Lal (@tedeschini)

    Thanks for quoting my (sad) story, Paul. Of course, organizing one’s own archive is important, but I think we also need some sort of “institutional” tool/place to go to if we need to retreive something that we didn’t think we would need, when it was first published.

    1. Paul Bradshaw Post author

      Absolutely. The broader issue of archiving more complex news projects which involve code hosted externally (for example old Flash projects) is certainly something that needs addressing.

  2. Zibaldot

    I know it is not an actual proper archive but what about the Wayback Machine from the Internet Archives? There are screenshots of pages from CNNItalia too, some with functioning links (checked one from 2003). I assume you might have already checked it maybe, sorry if I am repeating in that case


Leave a Reply to kanvaasi Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.