This is the final part of a series of blog posts. The first explains how using feeds and social bookmarking can make for a quicker data journalism workflow. The second looks at how to anticipate and prevent problems; and how collaboration can improve data work.
Workflow tip 5. Think like a computer
The final workflow tip is all about efficiency. Computers deal with processes in a logical way, and good programming is often about completing processes in the simplest way possible.
If you have any tasks that are repetitive, break them down and work out what patterns might allow you to do them more quickly – or for a computer to do them.
The website IFTTT makes it particularly easy to automate such processes. The acronym stands for ‘IF This Then That’, and allows you to select triggers (such as a new email from someone, or a particular time of day) and subsequent actions (such as adding a new row with those details to a Google spreadsheet, or publishing a particular tweet).
You can adopt the same approach anywhere. For example, if you’re cleaning some data what’s the pattern in that process? Is there a way to automate that? Or is that a problem someone has already solved? Search for the problem and if there are tools that already tackle them. For example, there is a tool called Mr People which will clean up names in a dataset.
Look for patterns in data to help you work with them. Codes might always begin with the same characters, or be the same length – that can help you sort them, clean them or understand them. Dates have a particular pattern; so do phone numbers and postcodes, web links and email addresses. Use that to your advantage.
Scrapers (programs that comb webpages for data and store them in a form which allows you to ask it questions) take advantage of exactly the same patterns. If you see a series of webpages which display information in the same way then it should be possible to write a scraper which looks for that pattern and stores the information within it.
That pattern can range from the obvious structure of a webpage table, to a series of job listings that always have the job title in bold, and then the location in italics, and so on. In PDFs the pattern might be in the position of the information (always being in the same place on a page) to the way it is expressed (dates are always three pairs of digits separated by a slash; money is always a series of numbers preceded by a “£”; a relevant sentence might always contain a key phrase such as “stop and search” or “special educational needs”).
If you are scraping a series of pages the structure might be in the web addresses too: in one case, detailed in my book Scraping for Journalists, I was able to scrape data on free school meals in Scottish schools because I noticed that the hundreds of webpages they were published on had exactly the same URL apart from a six-digit ID code at the end: find a list of those codes and you could cycle through them, adding each to the basic URL and scraping the information on it.
But finally, data often contains dozens of possible leads and stories, so be systematic and focused, and don’t get distracted.
Having flow charts and lists helps me in this regard. For example, my ‘Inverted pyramid of data journalism‘ lists the following processes:
That reminds me that for any story I need to compile the data, clean it, put it into context, combine it with other data, and then communicate the results.
The second part of the pyramid – on communicating the story – involves six possibilities:
The first three are traditional ways of telling stories in print and broadcast journalism. But the others are new: we can now create personalised stories (“How does the budget affect you?”); social stories (“Share your position in the world’s population with your friends!”; “Help us find out what expenses your MP claimed”); and stories-as-tools (“Click here to generate an FOI request to find out more about this spending”).
Another flow chart I created maps out the possible routes to compiling data: this ensures I don’t miss any options in getting hold of the data I need – or writing a story about the lack of data itself.
Finally, a ‘chart chooser’ by A Abela details the different types of story you might want to tell with data – and the relevant charts to go for. A story about composition (“50% of charity money is going on marketing”) is likely to be told with a pie chart or treemap. If it’s over time (“Marketing spend has risen from 25% to 50% over the last three years”) then a stacked area chart is a good choice.
Other types of stories might be about comparison (“Ambulance service staff have a much higher sick rate than other parts of the health service”), relationships (“Young black men are much more likely to be stopped and searched”) or distribution (“These GP surgeries have prescription patterns which stick out as unusual compared with where most others lie”).
These models, flow charts, heuristics, help you do jobs quickly. We often internalise them as we go about our jobs – but some professions, such as the health system, make them explicit to avoid missing key steps. Making them visible also helps you identify blind spots and think critically about your own processes.
Ultimately, that’s key – because we haven’t yet worked out what best practice is. Having good workflow habits to begin with makes it much easier to find out.
If you have any other tips about good data journalism workflows, please let us know in the comments.