Tag Archives: excel

My next ebook: Finding Stories in Spreadsheets

Finding Stories In Spreadsheets ebook cover

In a few weeks I will begin publishing my new ebook: Finding Stories In Spreadsheets.

The book has been written in response to requests from journalists who need a book on Excel aimed at storytellers, not accountants.

Finding Stories In Spreadsheets will outline a range of techniques, including ways to find the ‘needle in the haystack’ in text data, number calculations to make stories clearer, and methods of cleaning and combining data to tell new stories, including getting data ready for maps and charts.

The book will be available for a discounted rate for the first couple of weeks. To be informed when it’s available, register your interest on the Leanpub page.

A sample dirty dataset for trying out Google Refine

I’ve created this spreadsheet of ‘dirty data‘ to demonstrate some typical problems that data cleaning tools and techniques can be used for:

  • Subheadings that are only used once (and you need them in each row where they apply)
  • Odd characters that stand for something else (e.g. a space or ampersand)
  • Different entries that mean the same thing, either because they are lacking pieces of information, or have been mistyped, or inconsistently formatted

It’s best used alongside this post introducing basic features of Google Refine. But you can also use it to explore more simple techniques in spreadsheets like Find and replace; the TRIM function (and alternative solutions); and the functions UPPER, LOWER, and PROPER (which convert text into all upper case, lower case, and titlecase respectively).

Thanks to Eva Constantaras for suggesting the idea.

Is this an Excel killer? QueryTree app lowers the bar on data journalism

QueryTree

Sometimes the most impressive tools solve a problem you never knew you had. In the case of QueryTree, a new data analysis tool, that problem is something most people never question: spreadsheets.

For all the shiny-shiny copy-and-paste-click-and-drag-ness in new journalism tools, most data digging comes back to at least some simple spreadsheet work, and that represents a significant hurdle for many journalists used to working with simpler tools.

While interface design has undergone generations of improvement on the web, spreadsheet software interfaces have remained largely unchanged for decades.

So why did no one think to do this before?

QueryTree - how the drag and drop interface works

You only need 10 choices

Continue reading

Getting full addresses for data from an FOI response (using APIs)

heatfullcolour11-960x1024

Here’s an example of how APIs can be useful to journalists when they need to combine two sets of data.

I recently spoke to Lincoln investigative journalism student Sean McGrath who had obtained some information via FOI that he needed to combine with other data to answer a question (sorry to be so cryptic).

He had spent 3 days cleaning up the data and manually adding postcodes to it. This seemed a good example where using an API might cut down your work considerably, and so in this post I explain how you make a start on the same problem in less than an hour using Excel, Google Refine and the Google Maps API.

Step 1: Get the data in the right format to work with an API

APIs can do all sorts of things, but one of the things they do which is particularly useful for journalists is answer questions. Continue reading

Data journalism training – some reflections

OpenHeatMap - Percentage increase in fraud crimes in London since 2006_7

I recently spent 2 days teaching the basics of data journalism to trainee journalists on a broadsheet newspaper. It’s a pretty intensive course that follows a path I’ve explored here previously – from finding data and interrogating it to visualizing it and mashing – and I wanted to record the results.

My approach was both practical and conceptual. Conceptually, the trainees need to be able to understand and communicate with people from other disciplines, such as designers putting together an infographic, or programmers, statisticians and researchers.

They need to know what semantic data is, what APIs are, the difference between a database and open data, and what is possible with all of the above.

They need to know what design techniques make a visualisation clear, and the statistical quirks that need to be considered – or looked for.

But they also need to be able to do it.

The importance of editorial drive

The first thing I ask them to do (after a broad introduction) is come up with a journalistic hypothesis they want to test (a process taken from Mark E Hunter’s excellent ebook Story Based Inquiry). My experience is that you learn more about data journalism by tackling a specific problem or question – not just the trainees but, in trying to tackle other people’s problems, me as well.

So one trainee wants to look at the differences between supporters of David and Ed Miliband in that week’s Labour leadership contest. Another wants to look at authorization of armed operations by a police force (the result of an FOI request following up on the Raoul Moat story). A third wants to look at whether ethnic minorities are being laid off more quickly, while others investigate identity fraud, ASBOs and suicides.

Taking those as a starting point, then, I introduce them to some basic computer assisted reporting skills and sources of data. They quickly assemble some relevant datasets – and the context they need to make sense of them.

For the first time I have to use Open Office’s spreadsheet software, which turns out to be not too bad. The data pilot tool is a worthy free alternative to Excel’s pivot tables, allowing journalists to quickly aggregate & interrogate a large dataset.

Formulae like concatenate and ISNA turn out to be particularly useful in cleaning up data or making it compatible with similar datasets.

The ‘Text to columns’ function comes in handy in breaking up full names into title, forename and surname (or addresses into constituent parts), while find and replace helped in removing redundant information.

It’s not long before the journalists raise statistical issues – which is reassuring. The trainee looking into ethnic minority unemployment, for example, finds some large increases – but the numbers in those ethnicities are so small as to undermine the significance.

Scraping the surface of statistics

Still, I put them through an afternoon of statistical training. Notably, not one of them has studied a maths or science-related degree. History, English and Law dominate – and their educational history is pretty uniform. At a time when newsrooms need diversity to adapt to change, this is a little worrying.

But they can tell a mean from a mode, and deal well with percentages, which means we can move on quickly to standard deviations, distribution, statistical significance and regression analysis.

Even so, I feel like we’ve barely scraped the surface – and that there should be ways to make this more relevant in actively finding stories. (Indeed, a fortnight later I come across a great example of using Benford’s law to highlight problems with police reporting of drug-related murder)

One thing I do is ask one trainee to toss a coin 30 times and the others to place bets on the largest number of heads to fall in a row. Most plump for around 4 – but the longest run is 8 heads in a row.

The point I’m making is regarding small sample sizes and clusters. (With eerie coincidence, one of them has a map of Bridgend on her screen, which made the news after a cluster of suicides).

That’s about as engaging as this section got – so if you’ve any ideas for bringing statistical subjects to life and making them relevant to journalists, particularly as a practical tool for spotting stories, I’m all ears.

Visualisation – bringing data to life, quickly

Day 2 is rather more satisfying, as – after an overview of various chart types and their strengths and limitations – the trainees turn their hands to visualization tools – Many Eyes, Wordle, Tableau Public, Open Heat Map, and Mapalist.

Suddenly the data from the previous day comes to life. Fraud crime in London boroughs is shown on a handy heat map. A pie chart, and then bar chart, shows the breakdown of Labour leadership voters; and line graphs bring out new possible leads in suicide data (female suicide rates barely change in 5 years, while male rates fluctuate more).

It turns out that Mapalist – normally used for plotting points on Google Maps from a Google spreadsheet – now also does heat maps based on the density of occurrences. ManyEyes has also added mapping visualizations to its toolkit.

Looking through my Delicious bookmarks I rediscover a postcodes API with a hackable URL to generate CSV or XML files with the lat/long, ward and other data from any postcode (also useful on this front is Matthew Somerville’s project MaPit).

Still a print culture

Notably, the trainees bring up the dominance of print culture. “I can see how this works well online,” says one, “but our newsroom will want to see a print story.”

One of the effects of convergence on news production is that a tool traditionally left to designers after the journalist has finished their role in the production line is now used by the journalist as part of their newsgathering role – visualizing data to see the story within it, and possibly publishing that online to involve users in that process too.

A print news story – in this instance – may result from the visualization process, rather than the other way around.

More broadly, it’s another symptom of how news production is moving from a linear process involving division of labour to a flatter, more overlapping organization of processes and roles – which involves people outside of the organization as well as those within.

Mashups

The final session covers mashups. This is an opportunity to explore the broader possibilities of the technology, how APIs and semantic data fit in, and some basic tools and tutorials.

Clearly, a well-produced mashup requires more than half a day and a broader skillset than exists in journalists alone. But by using tools like Mapalist the trainees have actually already created a mashup. Again, like visualization, there is a sliding scale between quick and rough approaches to find stories and communicate them – and larger efforts that require a bigger investment of time and skill.

As the trainees are already engrossed in their own projects, I don’t distract them too much from that course.

You can see what some of the trainees produced at the links below:

Matt Holehouse:

Many Eyes _ Rate of deaths in industrial accidents in the EU (per 100k)

Rate of deaths in industrial accidents in the EU (per 100k)

Raf Sanchez:

Rosie Ensor

  • Places with the highest rates for ASBOs

Sarah Rainey

The New Online Journalists #6: Conrad Quilty-Harper

As part of an ongoing series on recent graduates who have gone into online journalism, The Telegraph’s new Data Mapping Reporter Conrad Quilty-Harper talks about what got him the job, what it involves, and what skills he feels online journalists need today.

I got my job thanks to Twitter. Chris Brauer, head of online journalism at City University, was impressed by my tweets and my experience, and referred me to the Telegraph when they said they were looking for people to help build the UK Political database.

I spent six weeks working on the database, at first manually creating candidate entries, and later mocking up design elements and cleaning the data using Freebase Gridworks, Excel and Dabble DB. At the time the Telegraph was advertising for a “data juggler” role, and I interviewed for the job and was offered it. Continue reading

Data journalism pt4: visualising data – tools and publishing (comments wanted)

This is a draft from a book chapter on data journalism (here are parts 1; two; and three, which looks the charts side of visualisation). I’d really appreciate any additions or comments you can make – particularly around tips and tools.

UPDATE: It has now been published in The Online Journalism Handbook.

Visualisation tools

So if you want to visualise some data or text, how do you do it? Thankfully there are now dozens of free and cheap pieces of software that you can use to quickly turn your tables into charts, graphs and clouds.

The best-known tool for creating word clouds is Wordle (wordle.net). Simply paste a block of text into the site, or the address of an RSS feed, and the site will generate a word cloud whose fonts and colours you can change to your preferences. Similar tools include Tagxedo (tagxedo.com) and Wordlings (http://wordlin.gs), both of which allow you to put your word cloud into a particular shape.

ManyEyes (manyeyes.alphaworks.ibm.com/manyeyes/) also allows you to create word clouds and tag clouds – as well as word trees and phrase nets that allow you to see common phrases. But it is perhaps most useful in allowing you to easily create scattergrams, bar charts, bubble charts and other forms. The site also contains a raft of existing data that you can play with to get a feel for the site. Similar tools that allow access to other data include Factual (factual.com), Swivel (swivel.com)[see comments], Socrata (socrata.com) and Verifiable.com (verifiable.com). And Google Fusion Tables (tables.googlelabs.com) is particularly useful if you want to collaborate on tables of data, as well as offering visualisation options.

More general visualisation tools include widgenie (widgenie.com), iCharts (icharts.net), ChartTool (onlinecharttool.com) and ChartGo (www.chartgo.com). FusionCharts is a piece of visualisation software with a Google Gadget service that publishers may find useful. You can find instructions on how to use it at www.fusioncharts.com/GG/Docs

If you want more control over your visualisation – or want it to update dynamically when the source information is updated, Google Chart Tools (code.google.com/apis/charttools) is worth exploring. This requires some technical knowledge, but there is a lot of guidance and help on the site to get you started quickly.

Tableau Public is a piece of free software you can download (tableausoftware.com/public) with some powerful visualisation options. You will also find visualisation options on spreadsheet applications such as Excel or the free Google Docs spreadsheet service. These are worth exploring as a way to quickly generate charts from your data on the fly.

Publishing your visualisation

There will come a point when you’ve visualised your data and need to publish it somehow. The simplest way to do this is to take an image (screengrab) of the chart or graph. This can be done with a web-based screencapture tool like Kwout (kwout.com), a free desktop application like Skitch (skitch.com) or Jing (jingproject.com), or by simply using the ‘Print Screen’ button on a PC keyboard (cmd+shift+3 on a Mac) and pasting the screengrab into a graphics package such as Photoshop.

The advantage of using a screengrab is that the image can be easily distributed on social networks, image sharing websites (such as Flickr), and blogs – driving traffic to the page on your site where it is explained.

If you are more technically minded, you can instead choose to embed your chart or graph. Many visualisation tools will give you a piece of code which you can copy and paste into the HTML of an article or blog post in the place you wish to display it (this will not work on most third party blog hosting services, such as WordPress.com). One particular advantage of this approach is that the visualisation can update itself if the source data is updated.

Alternatively, an understanding of Javascript can allow you to build ‘progressively enhanced’ charts which allow users to access the original data or see what happens when it is changed.

Showing your raw data

It is generally a good idea to give users access to your raw data alongside its visualisation. This not only allows them to check it against your visualisation but add insights you may not otherwise gain. It is relatively straightforward to publish a spreadsheet online using Google Docs (see the sidebar on publishing a spreadsheet)

SIDEBAR: How to: publish a spreadsheet online

Google Docs (docs.google.com) is a free website which allows you to create and share documents. You can share them via email, by publishing them as a webpage, or by embedding your document in another webpage, such as a blog post. This is how you share a spreadsheet:

  1. Open your spreadsheet in Google Docs. You can upload a spreadsheet into Google Docs if you’ve created it elsewhere – there is a size limit, however, so if you are told the file is too big try removing unnecessary sheets or columns.
  2. Look for the ‘Share’ button (currently in the top right corner) and click on it.
  3. A drop-down menu should appear. Click on ‘Publish as a web page’
  4. A new window should appear asking which sheets you want to publish. Select the sheet you want to publish and click ‘Start publishing’ (you should also make sure ‘Automatically republish when changes are made’ is ticked if you want the public version of the spreadsheet to update with any data you add.)
  5. Now the bottom half of that window – ‘Get a link to the published data’ – should become active. In the bottom box should be a web address where you can now see the public version of your spreadsheet. If you want to share that, copy the address and test that it works in a web browser. You can now link to it from any webpage.
  6. Alternatively, you can embed your spreadsheet – or part of it – in another webpage. To do this click on the first drop-down menu in this area – it will currently say ‘Web page’ – and change it to ‘HTML to embed in a page’. Now the bottom box on this window should show some HTML that begins with
  7. If you want to embed just part of a spreadsheet, in the box that currently says ‘All cells’ type the range of cells you wish to show. For example, typing A1:G10 will select all the cells in your spreadsheet from A1 (the first row of column A) to G10 (the 10th row of column G). Once again, the HTML below will change so that it only displays that section of your spreadsheet.

Once again, I’d welcome any comments on things I may have missed or tips you can add. Part 5, on mashups, is now available here.

Data journalism pt3: visualising data – charts and graphs (comments wanted)

This is a draft from a book chapter on data journalism (the first, on gathering data, is here; the section on interrogating data is here). I’d really appreciate any additions or comments you can make – particularly around considerations in visualisation. A further section on visualisation tools, can be found here.

UPDATE: It has now been published in The Online Journalism Handbook.

“At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers – even a very large set – is to look at pictures of those numbers.” (Edward Tufte, The Visual Display of Quantitative Information, 2001)

Visualisation is the process of giving a graphic form to information which is often otherwise dry or impenetrable. Classic examples of visualisation include turning a table into a bar chart, or a series of percentage values into a pie chart – but the increasing power of both computer analysis and graphic design software have seen the craft of visualisation develop with increasing sophistication. In larger organisations the data journalist may work with a graphic artist to produce an infographic that visualises their story – but in smaller teams, in the initial stages of a story, or when speed is of the essence they are likely to need to use visualisation tools to give form to their data.

Broadly speaking there are two typical reasons for visualising data: to find a story; or to tell one. Quite often, it is both. Continue reading