Tag Archives: excel

Finding Stories in Spreadsheets – ebook now live!

Finding stories in spreadsheets book cover

Cover design by Matt Buck/Drawnalism

My latest ebook – Finding Stories in Spreadsheets – is now live on Leanpub.

As with Scraping for Journalists, I’m publishing the book week-by-week so the book can be updated based on reader feedback, user suggestions and topical developments.

Each week you can download a new chapter covering a different technique for finding stories, from calculating proportions and changes, to combining data, cleaning it up, testing it, and extracting specific details.

There’s also a downloadable spreadsheet at the end of each chapter with a series of exercises to practise that chapter’s technique and find particular stories.

Along the way I tackle some other considerations in telling the story, such as context and background, and the importance of being specific in the language that you use.

If there’s anything you’d like covered in the book let me know. You can also buy the book in a ‘bundle’ with its sister title Data Journalism Heist, which covers quick-turnaround techniques for finding stories in spreadsheets using pivot tables and advanced filters.

My next ebook: Finding Stories in Spreadsheets

Finding Stories In Spreadsheets ebook cover

In a few weeks I will begin publishing my new ebook: Finding Stories In Spreadsheets.

The book has been written in response to requests from journalists who need a book on Excel aimed at storytellers, not accountants.

Finding Stories In Spreadsheets will outline a range of techniques, including ways to find the ‘needle in the haystack’ in text data, number calculations to make stories clearer, and methods of cleaning and combining data to tell new stories, including getting data ready for maps and charts.

The book will be available for a discounted rate for the first couple of weeks. To be informed when it’s available, register your interest on the Leanpub page.

A sample dirty dataset for trying out Google Refine

I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:

  • Subheadings that are only used once (and you need them in each row where they apply)
  • Odd characters that stand for something else (e.g. a space or ampersand)
  • Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted

It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).

Thanks to Eva Constantaras for suggesting the idea.

Is this an Excel killer? QueryTree app lowers the bar on data journalism

QueryTree

Sometimes the most impressive tools solve a problem you never knew you had. In the case of QueryTree, a new data analysis tool, that problem is something most people never question: spreadsheets.

For all the shiny-shiny copy-and-paste-click-and-drag-ness in new journalism tools, most data digging comes back to at least some simple spreadsheet work, and that represents a significant hurdle for many journalists used to working with simpler tools.

While interface design has undergone generations of improvement on the web, spreadsheet software interfaces have remained largely unchanged for decades.

So why did no one think to do this before?

QueryTree - how the drag and drop interface works

You only need 10 choices

Continue reading

Getting full addresses for data from an FOI response (using APIs)

heatfullcolour11-960x1024

Here’s an example of how APIs can be useful to journalists when they need to combine two sets of data.

I recently spoke to Lincoln investigative journalism student Sean McGrath who had obtained some information via FOI that he needed to combine with other data to answer a question (sorry to be so cryptic).

He had spent 3 days cleaning up the data and manually adding postcodes to it. This seemed a good example where using an API might cut down your work considerably, and so in this post I explain how you make a start on the same problem in less than an hour using Excel, Google Refine and the Google Maps API.

Step 1: Get the data in the right format to work with an API

APIs can do all sorts of things, but one of the things they do which is particularly useful for journalists is answer questions. Continue reading

Data journalism training – some reflections

OpenHeatMap - Percentage increase in fraud crimes in London since 2006_7

I recently spent 2 days teaching the basics of data journalism to trainee journalists on a broadsheet newspaper. It’s a pretty intensive course that follows a path I’ve explored here previously – from finding data and interrogating it to visualizing it and mashing – and I wanted to record the results.

My approach was both practical and conceptual. Conceptually, the trainees need to be able to understand and communicate with people from other disciplines, such as designers putting together an infographic, or programmers, statisticians and researchers.

They need to know what semantic data is, what APIs are, the difference between a database and open data, and what is possible with all of the above.

They need to know what design techniques make a visualisation clear, and the statistical quirks that need to be considered – or looked for.

But they also need to be able to do it.

The importance of editorial drive

The first thing I ask them to do (after a broad introduction) is come up with a journalistic hypothesis they want to test (a process taken from Mark E Hunter’s excellent ebook Story Based Inquiry). My experience is that you learn more about data journalism by tackling a specific problem or question – not just the trainees but, in trying to tackle other people’s problems, me as well.

So one trainee wants to look at the differences between supporters of David and Ed Miliband in that week’s Labour leadership contest. Another wants to look at authorization of armed operations by a police force (the result of an FOI request following up on the Raoul Moat story). A third wants to look at whether ethnic minorities are being laid off more quickly, while others investigate identity fraud, ASBOs and suicides.

Taking those as a starting point, then, I introduce them to some basic computer assisted reporting skills and sources of data. They quickly assemble some relevant datasets – and the context they need to make sense of them.

For the first time I have to use Open Office’s spreadsheet software, which turns out to be not too bad. The data pilot tool is a worthy free alternative to Excel’s pivot tables, allowing journalists to quickly aggregate & interrogate a large dataset.

Formulae like concatenate and ISNA turn out to be particularly useful in cleaning up data or making it compatible with similar datasets.

The ‘Text to columns’ function comes in handy in breaking up full names into title, forename and surname (or addresses into constituent parts), while find and replace helped in removing redundant information.

It’s not long before the journalists raise statistical issues – which is reassuring. The trainee looking into ethnic minority unemployment, for example, finds some large increases – but the numbers in those ethnicities are so small as to undermine the significance.

Scraping the surface of statistics

Still, I put them through an afternoon of statistical training. Notably, not one of them has studied a maths or science-related degree. History, English and Law dominate – and their educational history is pretty uniform. At a time when newsrooms need diversity to adapt to change, this is a little worrying.

But they can tell a mean from a mode, and deal well with percentages, which means we can move on quickly to standard deviations, distribution, statistical significance and regression analysis.

Even so, I feel like we’ve barely scraped the surface – and that there should be ways to make this more relevant in actively finding stories. (Indeed, a fortnight later I come across a great example of using Benford’s law to highlight problems with police reporting of drug-related murder)

One thing I do is ask one trainee to toss a coin 30 times and the others to place bets on the largest number of heads to fall in a row. Most plump for around 4 – but the longest run is 8 heads in a row.

The point I’m making is regarding small sample sizes and clusters. (With eerie coincidence, one of them has a map of Bridgend on her screen, which made the news after a cluster of suicides).

That’s about as engaging as this section got – so if you’ve any ideas for bringing statistical subjects to life and making them relevant to journalists, particularly as a practical tool for spotting stories, I’m all ears.

Visualisation – bringing data to life, quickly

Day 2 is rather more satisfying, as – after an overview of various chart types and their strengths and limitations – the trainees turn their hands to visualization tools – Many Eyes, Wordle, Tableau Public, Open Heat Map, and Mapalist.

Suddenly the data from the previous day comes to life. Fraud crime in London boroughs is shown on a handy heat map. A pie chart, and then bar chart, shows the breakdown of Labour leadership voters; and line graphs bring out new possible leads in suicide data (female suicide rates barely change in 5 years, while male rates fluctuate more).

It turns out that Mapalist – normally used for plotting points on Google Maps from a Google spreadsheet – now also does heat maps based on the density of occurrences. ManyEyes has also added mapping visualizations to its toolkit.

Looking through my Delicious bookmarks I rediscover a postcodes API with a hackable URL to generate CSV or XML files with the lat/long, ward and other data from any postcode (also useful on this front is Matthew Somerville’s project MaPit).

Still a print culture

Notably, the trainees bring up the dominance of print culture. “I can see how this works well online,” says one, “but our newsroom will want to see a print story.”

One of the effects of convergence on news production is that a tool traditionally left to designers after the journalist has finished their role in the production line is now used by the journalist as part of their newsgathering role – visualizing data to see the story within it, and possibly publishing that online to involve users in that process too.

A print news story – in this instance – may result from the visualization process, rather than the other way around.

More broadly, it’s another symptom of how news production is moving from a linear process involving division of labour to a flatter, more overlapping organization of processes and roles – which involves people outside of the organization as well as those within.

Mashups

The final session covers mashups. This is an opportunity to explore the broader possibilities of the technology, how APIs and semantic data fit in, and some basic tools and tutorials.

Clearly, a well-produced mashup requires more than half a day and a broader skillset than exists in journalists alone. But by using tools like Mapalist the trainees have actually already created a mashup. Again, like visualization, there is a sliding scale between quick and rough approaches to find stories and communicate them – and larger efforts that require a bigger investment of time and skill.

As the trainees are already engrossed in their own projects, I don’t distract them too much from that course.

You can see what some of the trainees produced at the links below:

Matt Holehouse:

Many Eyes _ Rate of deaths in industrial accidents in the EU (per 100k)

Rate of deaths in industrial accidents in the EU (per 100k)

Raf Sanchez:

Rosie Ensor

  • Places with the highest rates for ASBOs

Sarah Rainey