3 concepts from archive studies that every data journalist should know

Some useful frameworks for judging data from archival field outlined by @JamesLowryRAI at #datajustice18 in relation to Kenyan open data – including provenance (in that case opaque), custody (undocumented) and curation (no processes noted)

— Paul Bradshaw (@paulbradshaw) May 22, 2018

Until last month I hadn’t heard of diplomatic studies. It’s the discipline of studying historical documents, and comes from the word ‘diploma’, as in ‘verifying that someone hasn’t faked their records’ (I’m paraphrasing here). But this discipline of verification has some useful lessons for journalists — particularly data journalists — because it provides a very handy framework for picking apart what makes a record (data) credible, and what we should be looking out for when establishing that.

Particularly useful are three terms that are used to distinguish different aspects of a record’s credibility: authenticity; reliability; and accuracy.

Luciana Duranti’s paper on electronic records (PDF) defines each of the three concepts in depth, and — although she notes that the terms are given different meanings in different sectors — it is worth exploring in detail…

Has it been faked? Authenticity

"Documents seldom lie, but because it’s written, it doesn’t mean it’s true. You’ll need to verify your #opensources like you would check human sources." @EmmanuelFreuden's advice & tips in finding open source secrets https://t.co/DOZCZh9Cej pic.twitter.com/iFoAkP6GB7

— Global Investigative Journalism Network (@gijn) May 30, 2018

Authenticity is perhaps the easiest concept to start with: this refers to whether a record is what it claims to be. In other words, whether it has been faked, or tampered with.

This is a separate quality to whether the record is factually accurate or true. For example, a person might have a fake passport (not authentic) but the facts on the passport are true (name, gender, age, and so on); or a regulator might publish a report on a company which is authentic (it was indeed collected by that organisation, and not tampered with) but is full of lies or omissions.

Duranti specifies that for a record to be authentic:

“It must be possible to ascertain at all times what a record is, when it was created, by whom, what action or matter it participated in, and what its juridical/administrative, cultural, and documentary contexts were. It must also be possible to ascertain the wholeness and soundness of the record: whether it is intact or, if not, what is missing.”

Is is true? Reliability and accuracy

The factual nature of a record is split into two separate qualities – reliability and accuracy – and the distinction is instructive.

Reliability, says Duranti, refers to the trustworthiness of a record “as a statement of fact.” It exists, she writes,

“When a record can stand for the fact it is about, and is established by examining the completeness of the record’s form and the amount of control exercised on the process of its creation.”

These criteria of completeness and control take us into the territory of ‘How to lie with statistics’.

So, for example, a record can be true but incomplete, because the organisation collecting that information did not ask particular questions or gather particular pieces of evidence (intentionally or not).

As Peter Hirtle explains in his paper on archival authenticity:

“Only records that are complete can ensure accountability and protect personal rights. As soon as records become incomplete, their authority is called into question. For example, when information is missing in a record, we do not know if it is because the information was never created or because it has been discarded. Individual records must be complete; they must contain all the information they had when they were created. They must also maintain their original structure and context.”

An organisation can similarly exercise varying degrees of control over the collection of information, choosing more or less statistically rigorous methods of collection, or phrasing questions to encourage a particular response, or treating non-responses in a particular way, or classifying concepts in a way which may not be the same as others classify those concepts. Recent examples include the classification of deaths in police custody, or child poverty.

Hirtle also articulates some of the issues around evaluating original documents versus copies:

“Creating a copy always introduces the possibility for variation or change from the original … [but] there are times when a copy may be more reliable than an original. For example, a contract for the sale of the house that is copied into the deed books of a village government may be more reliable than the original, because a third, impartial, authority can attest to the agreement of the parties represented in the contract.”

Accuracy, in contrast, refers to the truthfulness of the content of the record. This, Duranti says, “can only be established through content analysis.”

Put another way, records can be authentic (unmanipulated) and reliable (gathered rigorously and conscientiously with no political agenda), but still not be accurate.

Perhaps more usefully for journalists, records can be accurate in relation to the things they relate to but not reliable (because they are missing crucial pieces of information, or are misleading in the questions that they asked). It is at that point that the data journalist might decide to embark on their own data collection exercise to counter the biases or omissions of the organisations who are making claims of truth that could be described as “unreliable”.

Digital challenges

The crisis facing trust in records in the digital era is articulated well by Duranti. It used to be the case, for example, that accuracy was “usually inferred on the basis of the degree of the records’ reliability and … only verified when such degree [was] very low.” But Duranti points out that:

“The volatility of the digital medium, the ease of change, editing, and the difficulty of version control, all make it harder to presume accuracy on the traditional bases.”

On reliability, similarly, she notes that:

“Records generated using new information technologies make increasingly difficult to determine when a record is complete and whether the controls established on its creation are either sufficient or effective for anyone to be able to assume its reliability.”

And on authenticity she argues:

“The pervasiveness of increasingly complex, fast-changing computer technology is making the authenticity of electronic records very hard to demonstrate and to preserve [in] the face of incompatibility and obsolescence.”

All the more need, then, for a framework which allows us to more systematically judge the qualities of the records that we are dealing with (Peter Hirtle’s paper does outline some attempts at improving digital authentication within the archiving community).

Archives are not just facts, but records of action

Aside from the reliability of records it’s worth drawing one more lesson from the literature of archive studies, summed up by Peter Hirtle:

“A document may contain lies, errors, falsehoods, or oversights-but still be evidence of action by an agency. Nor does a record have to be particularly interesting or important, or even something that anyone would ever want to consult again. Pure archival interest in records depends not on their informational content, but on the evidence they provide of government or business activity.”

Hirtle’s paper includes another useful lesson regarding the importance of collections of records:

“By itself, a case file can tell the user a great deal, but it does not reveal whether the individual in question was treated differently from other people in the same situation. To understand a single record in context, one needs the whole series. There may be references from the case file to other records in the same series. Whenever possible, therefore, archivists seek to preserve entire series.”

Records in journalism

1 – Thread on the Home Office destroying the #Windrush landing cards. Here's how the system should work. Under the Public Records Act, govmt departments must transfer archive records to @UkNatArchives within 30 (now 20) years.

— Dr. Bendor Grosvenor 🇺🇦 (@arthistorynews) April 17, 2018

These lessons are played out in a number of recent stories: in the Windrush scandal, where records existed for people who arrived in this country, and the story was not about their reliability or authenticity; in Reveal’s investigation into undercounting injuries in Tesla factories: classification was changed (controlled) in a way which reflected the priorities of the business.

It is the lesson of the Bureau Local’s recent investigation into homeless deaths: information was not collected which should have been when a homeless person dies. Government inaction, official carelessness and callousness, extends to the decisions that are made about what to record, how to record it, and how to secure that for the future. Diplomatic studies gives us a readymade way to think about those stories.

Thanks to Bill Hodgson in the comments for pointing out that it is not diplomatics plural, but diplomatic (studies) in the singular. Updated July 19 2018.

7 thoughts on “3 concepts from archive studies that every data journalist should know”

Pingback: Online Journalism Blog: 3 concepts from archive studies that every data journalist should know | ResearchBuzz: Firehose
Pingback: Washington History, Google Hangouts, Career Prep, More: Tuesday Afternoon Buzz, June 5, 2018 – ResearchBuzz
depatridge June 5, 2018 at 6:52 pm

Reblogged this on Matthews' Blog.

Reply ↓
CARL R D'Agostino June 5, 2018 at 8:58 pm

Great article esp in reference to all the fake news posts by Hillary/Obama hater and Trump haters. People rarely seem to check the accuracy of really bizarre and absurd posts. Some really don’t care as long as it legitimizes what they want to believe or value its negative attack quality more than they value the truth. I try to be very careful. It was a little embarrassing to have a quote attributed to Hillary Clinton on my facebook page proven false (I am a Hillary-Billary hater). It stimulated a lot of discourse and some demanded I take down the post. I refused because I felt it would be cowardly and my integrity demanded that I allow the comments disproving the quote remain public. So I took the hit. As the article points out false news is often posted by otherwise credible sources and one may be duped into sharing it because posted by a source that seems credible.

Reply ↓
Bill Hodgson July 6, 2018 at 8:31 am

It’s actually ‘diplomatic’ in the singular.

Reply ↓
1. Paul Bradshaw Post authorJuly 19, 2018 at 9:23 am
  
  Thanks for that Bill – I’ve corrected the article.
  
  Reply ↓
Pingback: 3 concepts from archive studies (that every journalist should know) | Media for Social Change