Ethics in data journalism: privacy, user data, collaboration and the clash of codes

This is the second in a series of extracts from a draft book chapter on ethics in data journalism. The first looked at how ethics of accuracy play out in data journalism projectsThis is a work in progress, so if you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.

Gun permit holders map - image from Sherrie Questioning All

Gun permit holders map – image from Sherrie Questioning All

Hacks/Hackers: collaboration and the clash of codes

Journalism’s increasingly collaborative and global nature in a networked environment has raised a number of ethical issues as contributors from different countries and from professions outside of journalism – with different codes of ethics – come together.

This collaborative spirit is most visible in the ‘Hacks/Hackers’ movement, where journalists meet with web developers to exchange tips and ideas, and work on joint projects. Data journalists also often take part in – and organise – ‘hack days’ or ‘hackathons’ aimed at opening up and linking data and creating apps, or work with external agencies to analyse data gathered by either party.

This can lead to culture clashes around differing ethical principles. In his seminal volume Hackers: Heroes of the Computer Revolution (1984), for example, Steven Levy outlines normative values that hackers should adhere to, most relevantly: sharing; openness; decentralisation; and improving the world. Journalists not adhering to these values – for example, those who do not share their data or who are not open about the processes involved in acquiring or analysing it – could find themselves accused of unethical behaviour by their collaborators.

In these cases it is important to be aware of different parties’ ethical assumptions and clear about the journalist’s role. For example, in the collaborations between Wikileaks and various news organisations, The Guardian, along with The New York Times and Der Spiegel, wanted to retain a gatekeeping role and “wanted to make sure that we didn’t reveal the names of informants or unnecessarily endanger Nato troops” (Rogers, in Gray, Bounegru and Chambers, 2012). However, the very structured nature of the data still allowed them to make most of it available by simply removing the fields that could identify individuals:

“We wanted to make it simpler to access key information, out there in the real world — as clear and open as we could make it … we [allowed] our users to download a spreadsheet containing the records of every incident where somebody died, nearly 60,000 in all [but] removed the summary field so it was just the basic data: the military heading, numbers of deaths and the geographic breakdown … we couldn’t be sure the summary field didn’t contain confidential details of informants and so on.” (Rogers, in Gray, Bounegru and Chambers, 2012)

In the end, however, a journalist did perform a role in leaking the complete data that served as a lesson in protecting sources. In his book on Wikileaks (Leigh and Harding, 2011), Guardian journalist David Leigh included a password of Assange’s which turned out to be “the full passphrase to WikiLeaks’ copy of the encrypted, unredacted cables.

“To Leigh, the PGP password might have seemed like a harmless historical detail … He later said that Assange had told him the password was defunct [but to Assange] and any other hacker, revealing a password represented a glaring security breach. Those familiar with PGP know that when a file is encrypted to a particular key, the key will always open a copy of that encrypted file and thus can never be revealed. Secret keys remain secret for life.” (Greenberg, 2012)

Privacy

Another example where ethical codes sometimes differ between occupations and countries is in privacy.

The FarmSubsidy investigation into recipients of EU farm subsidies is a particularly good example of this. The investigation was a collaboration between journalists in different countries with differing concepts of ‘personal data’. They used those countries’ access to information laws to identify individuals receiving multimillion-euro subsidies, before a judgment of the Court of Justice of the European Union limited publication on the basis of protecting personal privacy (Zijlstra, 2011). The judgement is illustrative of the ethical tensions involved:

“The Council and the Commission had to look for methods of publishing information that might also obtain the objective of transparency, but not have the same impact on the privacy of the beneficiaries involved. This could include limiting the publication by name of the beneficiaries according to the periods for which they received aid rather than maintaining it for 2 years. As the Council and the Commission did not consider such measures, the provisions … were considered invalid insofar as they imposed an obligation to publish personal data relating to each beneficiary without drawing a distinction based on relevant criteria such as the periods during which those persons have received such aid, the frequency of such aid or the nature and amount thereof.”

In the UK there already exists guidance on when a similarly ‘private’ piece of data – salaries – should be disclosed (Information Commissioner’s Office, 2009 – PDF). The guidance identifies “a legitimate public interest” in general pay bands of “more senior staff who are responsible for major policy and financial initiatives” but not junior staff with little power or influence. It acknowledges: “There could be factors that weigh in favour of greater disclosure, such as legitimate concerns about corruption or mismanagement, or situations in which senior staff set their own or others’ pay.”

That seems a sensible guideline to adopt in publishing (more recent judgements have also recommended disclosing specific remuneration).

In the US personal data is more accessible, but that does not mean its publication does not create ethical issues. Following a mass shooting in Connecticut, the Journal News in New York decided to publish an interactive map of pistol permit holders in the the area, based on Freedom of Information requests. The map led to a backlash from readers and calls for a boycott of the paper.

Kathleen Culver notes that the publication could have considered the ethical issues more carefully. The ethical principle of ‘minimising harm’, for example, was not served by making it easy for criminals to identify which homes did not have a pistol (or at least a licence for one) and were therefore less well defended – or conversely, which ones had a gun they might steal. The principle of accuracy may have been ignored by publishing data which was out of date, and the principle of context was overlooked in not comparing the licence ownership to anything else (such as gun use, or patterns of ownership). Culver suggests the map should not have identified individuals but rather aggregated totals, allowing the map to still serve the public interest while avoiding the attendant risks (Culver, 2013).

A similar example is Tampa Bay Mug Shots, which pulls a feed from police websites of people charged with a crime. There is no context, and no corrections for those found innocent. Nora Paul, the director of the Institute for New Media Studies at the University of Minnesota, felt the site’s lack of context meant it “borders on journalistic malpractice” (Milian, 2009), although some precautions were taken “so that the data do not haunt the alleged criminals forever”:

“Every listing is hidden after 60 days from booking, and the developers have taken technical precautions to ensure that Google’s search engine won’t crawl and index the pages.”

Even anonymized data can reveal private information. When the Kansas City Star analysed a healthcare practitioner’s database and combined it with other records, they found 21 doctors with multiple malpractice payments that had never been disciplined.

“[The reporter] had performed broad research of courts, state agencies and hospital actions, “allowing them to connect the dots” to individual doctors. But he said the federal database itself did not reveal identities.” (Wilson, 2011)

Once again, the judgement is between public interest and individual privacy. In this case there was a clear case, in the absence of action, for disclosure.

Sophie Hood describes the change brought about by the digitisation of public records as

“The difference between the practical obscurity of a paper record versus the ubiquity of an electronic record … The information is the same, but it is radically altered by the novel ways in which it flows. When documents are published online, there is more access but there is also more susceptibility for error…what is at stake is different — dramatically so.” (Kinstler, 2013)

Ultimately, in a paper with Helen Nissenbaum, she returns to the importance of context:

“Instead of characterizing privacy as control over personal information, or as the limitation of access to information, [contextual integrity] characterizes privacy as conformance with appropriate flows of information, in turn modeled by the theoretical construct of context-relative (or context-specific) informational norms. When information is captured or disseminated in ways that violate informational norms, privacy as contextual integrity is violated. (Kinstler, 2013)

Identification does not always take place in the published data. A story in the UK’s Birmingham Evening Mail based on spending data headlined ‘City council spends nearly £1m on bed and breakfast rooms for Birmingham’s homeless’ not only named one recipient, but also led the story with prominent photographs of two hotels.

Given the vulnerability of groups staying in these hotels, it might be asked whether the impact of publishing such details was justified. In this case, although the identities of recipients added concrete detail, it was not essential to the story, which was about the body spending the money. Having an internal list of groups considered to be vulnerable might assist in making quick decisions in the newsroom in similar situations.

A similar example comes from an investigation into Olympic torchbearers. The author received an anonymous tip-off that two daughters of one sponsor’s chief executive had been given Olympic torchbearer places. This did indeed turn out to be the case, but as one of the daughters was under 16 the decision was taken by one national newspaper not to name the executive because it would identify the minor.

User data

Journalists sometimes look for stories in data on the users of their site. In one particularly high profile example, a journalist at the financial news service Bloomberg noticed that one of the users of the service had not logged in for some time. He picked up the phone and asked their employer – Goldman Sachs – if they had left the company. The question led to a complaint from Goldman Sachs and a growing scandal about journalists’ use of data on 300,000 users of Bloomberg services.

“Bloomberg staffers could determine not only which of its employees had logged into Bloomberg’s proprietary terminals but how many times they had used particular functions,” reported the New York Post. Spokesman Ty Trippet pointed out that the data did not include “security-level data, position data, trading data or messages”, but it did include the behaviour of fellow workers and bosses (Seward, 2013), and the controversy was significant enough to lead the company to announce that journalists would no longer have access to client log-in activity, and to review their internal standards.

In the next part I look at mass data gathering. If you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.