Tag Archives: Scraping for Journalists

Cross-post: Why I started self-publishing

The following was written for three:d, the newsletter of MeCCSA, the Media Communications and Cultural Studies Association (PDF, page 9).

Something has happened to self-publishing over the past few years. No longer the last resort for local historians and wannabe poets, it is now a sign of entrepreneurial spirit, an alternative to the limitations of attention-starved journalism, and a way of kicking against the pricks of mainstream publishing. Self-published books have almost tripled in number over the last five years, with a number of authors making the bestseller lists. More than one in ten ebooks bought by UK readers is now self-published.

This year I finally joined that group, as I made a long-planned move away from writing for traditional publishers towards publishing my own ebooks. In fact, I published three. So what’s the appeal? Continue reading →

Scraping using regular expressions in OutWit Hub – part 2: special characters, negative matches and more

13 Replies

Image by Lasse Havelund

In the second part of this extract from Chapter 10 of Scraping for Journalists I recap the basics before discussing techniques to use in looking for patterns in data, and how regex can deal with non-textual characters such as spaces and carriage returns, special characters such as backslashes, and ‘negative matches’. You can find the first part here.

Continue reading →

How-to: Scraping ugly HTML using ‘regular expressions’ in an OutWit Hub scraper

11 Replies

Regular Expressions cartoon from xkcd

The following is the first part of an extract from Chapter 10 of Scraping for Journalists. It introduces a particularly useful tool in scraping – regex – which is designed to look for ‘regular expressions’ such as specific words, prefixes or particular types of code. I hope you find it useful.

This tutorial will show you how to scrape a particularly badly formatted piece of data. In this case, the UK Labour Party’s publication of meetings and dinners with donors and trade union general secretaries.

To do this, you’ll need to install the free scraping tool OutWit Hub. Regex can be used in other tools and programming as well, but this tool is a good way to learn it without knowing any other programming. Continue reading →

Scraping for Journalists – ebook out now

3 Replies

My ebook Scraping for Journalists: How to grab data from hundreds of sources, put it in a form you can interrogate – and still hit deadlines is now live.

You can buy it from Leanpub here. Leanpub allows you to publish in installments, so you get an alert every time new content is added and update your version. This means I can adapt and improve the book based on feedback from the people who use it. In other words, it’s agile publishing, which makes for a better book. (Also, I can publish at a Codecademy-like weekly pace which suits learning particularly well.)

There’s a Facebook page and a support blog for the book for commenting too.

Meanwhile, here’s a presentation I did at News:Rewired last week which covers some of the ground from the book:

Online Journalism Blog

Comment, analysis and links covering online journalism and online news, citizen journalism, blogging, vlogging, photoblogging, podcasts, vodcasts, interactive storytelling, publishing, Computer Assisted Reporting, User Generated Content, searching and all things internet.