Cross-posted from Data Driven Journalism.
Earlier this year I set out to tackle a problem that was bothering me: journalists who had started to learn programming were giving up.
They were hitting a wall. In trying to learn the more advanced programming techniques – particularly those involved in scraping – they seemed to fall into one of two camps:
- People who learned programming, but were taking far too long to apply it, and so losing momentum – the generalists
- People who learned how to write one scraper, but could not extend it to others, and so becoming frustrated – the specialists
In setting out to figure out what was going wrong, I set myself a task which I have found helpful in taking a fresh perspective on an issue: I started writing a book chapter.
The nice thing about writing books is that they force you to put together a coherent and complete narrative about an entire process. You identify gaps that you weren’t otherwise aware of, and you have to put yourself in the place of someone with no knowledge at all. You take nothing for granted.
So my starting point was this: what is a good way to learn how to write scrapers?
That’s a different question to ‘How do I write a scraper?’ and also to ‘How do I learn programming?’ And that’s important. Because most of the resources available fell into one of those two camps.
The people trying to learn programming were hitting a common problem in learning: lack of feedback. They might be able to change a variable in Ruby, but how would that help in journalism? It was like learning the structure of the entire French language just so they could go to the corner shop and ask for a loaf of bread.
The people learning how to write one scraper were hitting another common problem: learning how to do one task well, rather than the underlying principles. This was like someone learning how to ask for a loaf of bread in French, but not being able to extend that knowledge into asking for directions home.
I tackled both by beginning the chapter with probably the simplest scraper you can write: a spreadsheet formula in Google Docs. This provided the instant feedback that the generalists lacked, but the formula was also used to introduce some key concepts in programming: functions, strings, indexes, and parameters. These would provide key principles that the specialists lacked, and which future chapters could build on.
I also looked at how journalists tried to learn programming, and how programmers developed, and realised something else: journalists and programmers learned differently.
I’m generalising wildly, of course, but journalists – particularly student journalists – often try to learn programming from books. That may sound like common sense, but it’s not in an art or a science – and programming is both.
Programmers – if I’m to generalise wildly again – typically combine books (which they don’t read cover to cover) with documentation, adapting other code, trial and error, and each other. When they teach journalists, they often don’t realise that journalists don’t always share that culture.
And journalists – coming traditionally from a background in the humanities – are used to learning from books: static knowledge. Teaching programming to journalists then, I realised, would also mean teaching how programmers learn.
So my chapter introducing that first scraper introduced some other key concepts as well. It would direct readers to the documentation on the function being used, and invite them to engage in some trial and error to work out a solution to a problem. As more scraper tutorials were added, they introduced more key concepts in programming – importantly, without having to learn an entirely new language, and with documentation and trial and error running throughout, along with the principle of adapting other code.
I tested the approach at the News:Rewired conference. Can you teach scraping in 20 minutes? At a basic level, yes: it seemed you could.
After 20,000 words I realised that my book chapter was turning into a book. Meanwhile, a colleague had told me about Leanpub: a website that allowed people to publish books as they were being written, with readers able to download new updates as they came.
The platform suited the book perfectly: it meant I could stagger the publication of the book, Codecademy-style, with readers trying at least one scraper per week, but also having the time to experiment with trial and error before the next chapter was published. It meant that I could respond to feedback on the earlier chapters and adapt the rest of the book before it was published (in one case a Brazilian reader pointed out after the first chapter was published that the Portuguese-language Google Docs uses semi colons instead of commas). If examples used in the book changed then I could replace them. And it meant that if new tools or techniques emerged, I could incorporate them.
It is a programming-style approach to publishing – trial and error – which very much suits the spirit of the book. It’s extra work, but it makes for a much better writing experience. I hope the readers think so too.
Scraping for Journalists is available at Leanpub.com/ScrapingForJournalists