Tag Archives: scraping

Over 1000 journalists are now exploring scraping techniques. Incredible.

FAQ: Big data and journalism

The latest in the series of Frequently Asked Questions comes from a UK student, who has questions about big data.

How can data journalists make sense of such quantities of data and filter out what’s meaningful?

In the same way they always have. Journalists’ role has always been to make choices about which information to prioritise, what extra information they need, and what information to include in the story they communicate. Continue reading →

Training: scraping in the Netherlands

Ethics in data journalism: mass data gathering – scraping, FOI and deception

10 Replies

Automated mapping of data – ChicagoCrime.org – image from Source

This is the third in a series of extracts from a draft book chapter on ethics in data journalism. The first looked at how ethics of accuracy play out in data journalism projects, and the second at culture clashes, privacy, user data and collaboration. This is a work in progress, so if you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.

Mass data gathering – scraping, FOI, deception and harm

The data journalism practice of ‘scraping’ – getting a computer to capture information from online sources – raises some ethical issues around deception and minimisation of harm. Some scrapers, for example, ‘pretend’ to be a particular web browser, or pace their scraping activity more slowly to avoid detection. But the deception is practised on another computer, not a human – so is it deception at all? And if the ‘victim’ is a computer, is there harm? Continue reading →

How to think like a computer: 5 tips for a data journalism workflow part 3

1 Reply

This is the final part of a series of blog posts. The first explains how using feeds and social bookmarking can make for a quicker data journalism workflow. The second looks at how to anticipate and prevent problems; and how collaboration can improve data work.

Workflow tip 5. Think like a computer

The final workflow tip is all about efficiency. Computers deal with processes in a logical way, and good programming is often about completing processes in the simplest way possible.

If you have any tasks that are repetitive, break them down and work out what patterns might allow you to do them more quickly – or for a computer to do them. Continue reading →

It’s finished! Scraping for Journalists now complete (for now)

1 Reply

Last night I published the final chapter of my first ebook: Scraping for Journalists. Since I started publishing it in July, over 40 ‘versions’ of the book have been uploaded to Leanpub, a platform that allows users to receive updates as a book develops – but more importantly, to input into its development.

I’ve been amazed at the consistent interest in the book – last week it passed 500 readers: 400 more than I ever expected to download it. Their comments have directly shaped, and in some cases been reproduced in, the book – something I expect to continue (I plan to continue to update it).

As a result I’ve become a huge fan of this form of ebook publishing, and plan to do a lot more with it (some hints here and here). The format combines the best qualities of traditional book publishing with those of blogging and social media (there’s a Facebook page too).

Meanwhile, there’s still more to do with Scraping for Journalists: publishing to other platforms and in other languages for starters… If you’re interested in translating the book into another language, please get in touch.

2 how-tos: researching people and mapping planning applications

FAQ: Data journalism, scraping and Help Me Investigate

1 Reply

As part of my ongoing FAQ series, here are some answers in English to an interview by Balkan magazine Medicentar_Online: Continue reading →

7 laws journalists now need to know – from database rights to hate speech

13 Replies

Image by Mr T in DC

When you start publishing online you move from the well-thumbed areas of defamation and libel, contempt of court and privilege and privacy to a whole new world of laws and licences.

This is a place where laws you never knew existed can be applied to your work – while other ones can come in surprisingly useful. Here are the key ones:

Continue reading →

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

11 Replies

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful.

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading →

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.

Tag Archives: scraping

Over 1000 journalists are now exploring scraping techniques. Incredible.

FAQ: Big data and journalism

How can data journalists make sense of such quantities of data and filter out what’s meaningful?

Training: scraping in the Netherlands

Ethics in data journalism: mass data gathering – scraping, FOI and deception

Mass data gathering – scraping, FOI, deception and harm

How to think like a computer: 5 tips for a data journalism workflow part 3

Workflow tip 5. Think like a computer

It’s finished! Scraping for Journalists now complete (for now)

2 how-tos: researching people and mapping planning applications

FAQ: Data journalism, scraping and Help Me Investigate

7 laws journalists now need to know – from database rights to hate speech

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper