Mass data gathering – scraping, FOI, deception and harm
The data journalism practice of ‘scraping’ – getting a computer to capture information from online sources – raises some ethical issues around deception and minimisation of harm. Some scrapers, for example, ‘pretend’ to be a particular web browser, or pace their scraping activity more slowly to avoid detection. But the deception is practised on another computer, not a human – so is it deception at all? And if the ‘victim’ is a computer, is there harm? Continue reading →
Last night I published the final chapter of my first ebook: Scraping for Journalists. Since I started publishing it in July, over 40 ‘versions’ of the book have been uploaded to Leanpub, a platform that allows users to receive updates as a book develops – but more importantly, to input into its development.
I’ve been amazed at the consistent interest in the book – last week it passed 500 readers: 400 more than I ever expected to download it. Their comments have directly shaped, and in some cases been reproduced in, the book – something I expect to continue (I plan to continue to update it).
As a result I’ve become a huge fan of this form of ebook publishing, and plan to do a lot more with it (some hints here and here). The format combines the best qualities of traditional book publishing with those of blogging and social media (there’s a Facebook page too).
Meanwhile, there’s still more to do with Scraping for Journalists: publishing to other platforms and in other languages for starters… If you’re interested in translating the book into another language, please get in touch.
Sid Ryan wanted to see if planning applications near planning committee members were more or less likely to be accepted. In two guest posts on Help Me Investigate he shows how to research people online (in this case the councillors), and how to map planning applications to identify potential relationships.
The posts take in a range of techniques including:
Scraping using Scraperwiki and the Google Drive spreadsheet function importXML
Mapping in Google Fusion Tables
Registers of interests
Using advanced search techniques
Using Land Registry enquiries
Using Companies House and Duedil
Other ways to find information on individuals, such as Hansard, LinkedIn, 192.com, Lexis Nexis, whois and FriendsReunited
If you find it useful, please let me know – and if you can add anything… please do.
The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful.
To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading →
Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I love scraping: it is both a great time-saver, and a great source of stories no one else has. Continue reading →
Although I’ve written about scraping before on the blog, this book is designed to take the reader step by step through a series of tasks (a chapter each) which build a gradual understanding of the principles and techniques for tackling scraping problems. Everything has a direct application for journalism, and each principle is related to their application in scraping for newsgathering.
For example: the first scraper requires no programming knowledge, and is working within 5 minutes of reading.
I’m using Leanpub for this ebook, because it allows you to publish in installments and update the book for users – which suits a book like this perfectly, as I’ll be publishing chapters week by week, Codecademy-style.