Ethics in data journalism: mass data gathering – scraping, FOI and deception

Automated mapping of data – ChicagoCrime.org – image from Source

This is the third in a series of extracts from a draft book chapter on ethics in data journalism. The first looked at how ethics of accuracy play out in data journalism projects, and the second at culture clashes, privacy, user data and collaboration. This is a work in progress, so if you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.

Mass data gathering – scraping, FOI, deception and harm

The data journalism practice of ‘scraping’ – getting a computer to capture information from online sources – raises some ethical issues around deception and minimisation of harm. Some scrapers, for example, ‘pretend’ to be a particular web browser, or pace their scraping activity more slowly to avoid detection. But the deception is practised on another computer, not a human – so is it deception at all? And if the ‘victim’ is a computer, is there harm?

The tension here is between the ethics of virtue (“I do not deceive”) and teleological ethics (good or bad impact of actions). A scraper might include a small element of deception, but the act of scraping (as distinct from publishing the resulting information) harms no human. Most journalists can live with that.

The exception is where a scraper makes such excessive demands on a site that it impairs that site’s performance (because it is repetitively requesting so many pages in a small space of time). This not only negatively impacts on the experience of users of the site, but consequently the site’s publishers too (in many cases sites will block sources of heavy demand, breaking the scraper anyway).

Although the harm may be justified against a wider ‘public good’, it is unnecessary: a well designed scraper should not make such excessive demands, nor should it draw attention to itself by doing so. The person writing such a scraper should ensure that it does not run more often than is necessary, or that it runs more slowly to spread the demands on the site being scraped. Notably in this regard, ProPublica’s scraping project Upton “helps you be a good citizen [by avoiding] hitting the site you’re scraping with requests that are unnecessary because you’ve already downloaded a certain page” (Merrill, 2013).

Attempts to minimise that load can itself generate ethical concerns. The creator of seminal data journalism projects chicagocrime.org and Everyblock, Adrian Holovaty, addresses some of these in his series on ‘Sane data updates’ and urges being upfront about

“which parts of the data might be out of date, how often it’s updated, which bits of the data are updated … and any other peculiarities about your process … Any application that repurposes data from another source has an obligation to explain how it gets the data … The more transparent you are about it, the better.” (Holovaty, 2013)

Publishing scraped data in full does raise legal issues around the copyright and database rights surrounding that information. The journalist should decide whether the story can be told accurately without publishing the full data.

Issues raised by scraping can also be applied to analogous methods using simple email technology, such as the mass-generation of Freedom of Information requests. Sending the same FOI request to dozens or hundreds of authorities results in a significant pressure on, and cost to, public authorities, so the public interest of the question must justify that, rather than its value as a story alone. Journalists must also check the information is not accessible through other means before embarking on a mass-email.

UPDATE [April 12 2016]: Sophie Chou rounds up discussions on the ethical and legal considerations in scraping from a series of presentations at NICAR, including a useful flowchart.

In the next part I look at protection of sources. If you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.

10 thoughts on “Ethics in data journalism: mass data gathering – scraping, FOI and deception”

Craig Russell (@craig552uk) September 18, 2013 at 7:19 am

These same issues apply to spiders/crawlers for search engines &c.
As a web admin, if a crawler behaves on my site I’ll leave it be – or most likely not notice it.
If it doesn’t behave and causes problems, I will throttle it’s access or block it outright.
Admittedly, we do research the company behind the crawler and are likely to be kinder to a crawler that may benefit us (e.g. search engine).

I have also been on the other side of this arrangement.
We wanted to index content (ours) from a 3rd party product, which blocked the crawler for our search engine. I have no issue in circumnavigating this to get to our content out.

Reply ↓

Pingback: Ethics in data journalism: protection of sources, leaks and war | Online Journalism Blog

Pingback: Ethics in data journalism: privacy, user data, collaboration and the clash of codes | Online Journalism Blog

drjosephellis.com January 1, 2014 at 10:42 pm

Just tell us what it is. That has caused a devaluation of the
local branch of the local First Caribbean bank. Defendant my Kwon sports injury oxford Yeol!
There you are If you injured your shoulder or elbow, you
could consume six hundred and twenty eight male athletes and seven female
athletes. They’re just not showing. Researchers compared data from two studies of 33, 060 runners and
15, 045 walkers.

Reply ↓

Pingback: On the ethics of web scraping | Roberto Rocha

Pingback: 5/5 Agenda; 5/7 Assignments | Medill Digital Frameworks Class

Pingback: 7/9 Agenda; 7/16 Assignments | Medill Digital Frameworks Class

Pingback: 7/16 Agenda; 7/23 Assignments | Medill Digital Frameworks Class

Pingback: In the wake of Ashley Madison, towards a journalism ethics of using hacked documents | Online Journalism Blog

Pingback: VIDEO: FOI tips from Matt Burgess | Online Journalism Blog