Tag Archives: data

Charities data opened up – journalists: say thanks.

Having made significant inroads in opening up council and local election data, Chris Taggart has now opened up charities data from the less-than-open Charity Commission website. The result: a new website – Open Charities.

The man deserves a round of applause. Charity data is enormously important in all sorts of ways – and is likely to become more so as the government leans on the third sector to take on a bigger role in providing public services. Making it easier to join the dots between charitable organisations, the private and public sector, contracts and individuals – which is what Open Charities does – will help journalists and bloggers enormously.

A blog post by Chris explains the site and its background in more depth. In it he explains that:

“For now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the basic information for each charity available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

“The entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.”

Chris promises to add more features “if there’s any interest”.

Well, go on…

When Open Public Data Isn’t…?

This year was always going to be an exciting year for open data. The launch of data.gov.uk towards the end of last year, along with commitments from both sides of the political divide before the election that are continuing to be acted upon now means data is starting to be opened up -scruffily at first, but that’s okay – and commercial enterprises are maybe starting to get interested too…

…which was always the plan…

…but how is it starting to play out?

The story so far…

A couple of weeks ago, the first meeting of the Public Data Transparency Board was convened, which discussed – and opened up for further discussion, a set of draft public data principles. (Papers relating to the meeting can be found here.)

In a letter to the responsible Minister prior to the meeting (commentable extracts here), Professor Nigel Shadbolt suggested that:

4. … The economic analysis, and the views we regularly hear from the business community themselves, are unequivocal: data must be released for free re-use so that the private sector can add new value and develop innovative new business services from government information. …

8. Transparency principles need to be extended to those who operate public services on a franchised, regulated or subsidised basis. If the state is controlling a service to the public or is franchising or regulating its delivery the data about that activity should be treated as public data and made available. …

11. We need to support the development of licences and supporting policies to ensure that data released by all public bodies can be freely re-used and is interoperable with the internationally recognised Creative Commons model. …

12. A key Government objective is to realise significant economic benefits by enabling businesses and non-profit organisations to build innovative applications and websites using public data. …

The business imperative is further reinforced by the second of three reasons given by the Open Government Data tracking project in Why Open Government Data?:

Releasing social and commercial value. In a digital age, data is a key resource for social and commercial activities. Everything from finding your local post office to building a search engine requires access to data much of which is created or held by government. By opening up data, government can help drive the creation of innovative business and services that deliver social and commercial value.

So how has business been getting involved? As several local councils start to pick up a request contained in a letter from the Prime Minister published at the end of May that they open up their financial data, Chris Taggart/@countculture, developer of OpenlyLocal posted a piece on The open spending data that isn’t… this is not good in which he described how apparently privileged access to financial data from several councils was being used to drive Spikes Cavell’s SpotlightOnSpend website (for a related open equivalent, see Adrian Short’s Armchair Auditor). Downstream use of the data was hampered by a “personal use only” license, and a CAPTCHA that requires a human in the loop in order to access the data. The Public Sector Transparency Board promptly responded to Chris’ post (Work on Local Spending Data), quoting the principle that:

“Public data will be released under the same open licence which enables free reuse, including commercial reuse – all data should be under the same easy to understand licence. Data released under the Freedom of Information Act or the new Right to Data should be automatically released under that licence.”

and further commenting: “We have already reminded those involved of this principle and the existing availability of the ‘data.gov.uk’ licence which meets its criteria, and we understand that urgent measures are already taking place to rectify the problems identified by Chris.”

Spikes Cavell chief executive Luke Spikes responded via an interview with Information Age, (SpotlightOnSpend reacts to open criticism):

[SpotlightOnSpend] is first and foremost a spend analysis software and consultancy supplier, and that it publishes data through SpotlightOnSpend as a free, optional and supplementary service for its local government customers. The hope is that this might help the company to win business, he explains, but it is not a money-spinner in itself.

“The contribution we’re making to transparency is less about what the purists would like to see, it’s simply putting the information out there in a form that is useful for the audience for which it is intended [i.e. citizens and small businesses]” he said. “But there are a few things we haven’t done right, and we’ll fix that.”

Following the criticism, Cavell says that SpotlightOnSpend will make the data available for download in its raw form. “That’s what we thought was the most sensible solution to overcoming this obstacle,” he told me.

Adrian Short, developer of the open Armchair Auditor, then picked up the baton in a comment to the Information Age article:

There is room for Spikes Cavell to develop their applications and I doubt that anyone has any objection to them offering their services to councils commercially just like thousands of other businesses. But they do not have a monopoly of ideas, talent and resources to build great applications with public spending data. Nor does anyone else.

The concerns that @CountCulture raised were not that Spikes Cavell were trading with councils or trying to attract their business but that they are doing so in a way that precludes anyone else developing applications with this data. By legally and technically locking the data into the Spotlight on Spend platform, everyone else is excluded.

It’s understandable that most councils have no understanding of the culture, legalities or technicalities of open data. This is new territory for nearly all of them. Those councils that have put their data straight onto Spotlight on Spend, bypassing the part where it is made genuinely open — cannot be criticised for not complying with what to them must be a very unusual requirement. But that’s why @CountCulture and I and others want to be very clear that the end result of this process is having effective scrutiny of council finances through multiple websites and applications, not just Spotlight on Spend or any other single website or application. The way we get there is with open data.

And Chris Taggart’s response? (Update on the local spending data scandal… the empire strikes back):

Lest we forget, Spikes Cavell is not an agent for change here, not part of those pushing to open public data, but in fact has a business model which seems to be predicated on data being closed, and the maintenance of the existing procurement model which has served us so badly.

(For recommendations on how councils might publish financial data in an open way, see: Publishing itemised local authority expenditure – advice for comment (reiterated here: Open Government Data: Finances. The Office for National Statistics occasionally releases summary statistics (e.g. as republished in Openlocal: Local spending data in OpenlyLocal, and some thoughts on standards) but at only a coarse resolution. As to how much it might cost to do that, some are claiming Cost of publishing ‘£500 and above’ council expenditure prohibitive.)

From my own perspective, I would also add that should consultants like Spikes Cavell create derived data works from open public data, there should be some transparency in declaring how the derived work was created (see for example: So Where Do the Numbers in Government Reports Come From? and Data is not Binary).

Another example of how once open data is becoming “closed” behind a paywall comes from Paul Geraghty (“Closed Data Now” SOCITM does a “Times”):

If my memory serves me well the e-Gov Register (eGR) hosted by Brent has been showing every IT supplier sortable by product type, supplier, local government type and even on maps for about 6 or 7 years (some links below if you hurry up).

I am aware that there are problems with this data, in my own past employer I know that some of the data is out of date.

But it is there, it is useful and informative and it is OPEN to all, even SMEs like me researching on niche markets in local government.

The latest move by SOCITM (and presumably with the knowledge of the LGA and the IDeA) means all that data is going to go behind the SOCITM paywall.

And the response from Socitm, via a comment from Vicky Sargent:

First of all, I’d like to clear up some points of fact. No local authority or other public sector service provider that provides data to the Applications Register will have to pay for their subscription and for them, access to the data will be free, regardless of whether they subscribe to Socitm Insight (as 95% of local authorities do). Anyone who is employed in an organisation that is an Applications Register subscriber – f-o-c or paid, will be able to access the data.
Then there is who pays. Clearly an information service like this that adds value, has to cover the costs of development and delivery. Unlike government departments, LGA, IDeA and local councils, Socitm is not directly funded by the taxpayer, and needs to fund the services it delivers from money raised from fees, subscriptions, events and other services.
The business model we use for the Applications Register is that public bodies that contribute should not pay to use the service, but those that do not contribute pay in cash. Private sector bodies can only pay in cash.

Your article also suggests that Socitm’s support for the move towards open data is hypocritical, set against our business model for the Applications Register. I think this misunderstands the thinking behind ‘open data’, which is to get raw data out of government systems for transparency purposes, also so that it can be re-used. Socitm has been a long-term strong supporter of this.
The open data agenda explicitly acknowledges that ‘re-use’ includes adding value and selling on. If councils were to routinely publish the sort of data we will collect for the Applications Register, there would still be work to be done aggregating and manipulating and re-publishing the information to make it useful, and that is what we do, recovering our costs in the way described.

Adrian Short (can you see how it’s the same few players engaging in these debates?!;-) develops the “keep it free” argument in a further comment:

Your argument presupposes your conclusion, which is that Socitm is the best organisation to be managing/publishing the applications register. Because, as you correctly say, you don’t receive any direct funding from the taxpayer, you have to find other ways of paying for that work. Inevitably this means charging non-contributing users.

What you’re missing is that millions of pounds of public money is spent every year supporting businesses, helping to create new markets and generally oiling the parts of the economy that don’t easily oil themselves. That’s what BIS and the economic development departments of local authorities do. The public interest and private benefit aren’t easily distinguishable unless you contrive that private benefit for a small group to the exclusion of others. But as Paul rightly points out, the potential market for this information is enormous — essentially every business and individual that works for, supplies or wants to work for the public sector, from the individual IT worker to the massive global consultancies, manufacturers and software firms.

Currently it’s a small number of incumbent suppliers that benefit from this relatively inefficient market. Other businesses lose. Public sector buyers lose. The taxpayer loses.

Keeping this information free for everyone to use and enabling it to be used in future when combined with the enormous amount of data that will be released soon will be likely to produce economic benefits to the public through market efficiencies that outstrip its cost by several orders of magnitude. If Socitm can’t publish this data in the most useful, non-discriminatory way then it’s not the best organisation for the job. I can see no reason in principle or practice why it shouldn’t be fully funded by the taxpayer and free at the point of use for everyone. To do otherwise would be an extremely false economy.

(Note that “free vs. open” debates have also been played out in the open source software arena. Maybe it’s worth revisiting them…?)

The previously quouted comment from Vicky Sargent also contains what might be described as an example case study:

This brings me to Better Connected, the annual survey of council websites carried out by Socitm. You say:
Just about every council in the UK has little option but to pay SOCITM hundreds of pounds annually to join their club to find out the exact details of how their website is being ranked.The data for Better connected only exists because Socitm has devised a methodology for evaluating websites, pays for a team of reviewers collect the data each year, and then analyses and publishes the results. No one has to subscribe, they choose to do so because the information is valuable to them.
Information about how we do the evaluation and ranking is freely available on our website, in our press releases and in our free-to-join Website Usage and Improvement community. The 2010 headline results for all councils are published on socitm.net as open data under a creative commons licence and are linked from data.gov.uk.
If the Better connected report has become a ‘must read’, that is because the investment Socitm has made in the product has led to it being a more cost-effective investment for councils than alternative sources of advice on improving their website. Many users have told us Better connected (cover price £415 for non-subscribers or free as part of the Socitm Insight subscription that starts at £670 pa for a small district council) is worth many days’ consultancy, even when that consultancy is purchased from lower cost SME providers.

As these examples show, the license under which data is originally released can have significant consequences on its downstream use and commercialisation. The open source software community has know this for years, of course, which is why organisations like GNU have two different licenses – GPL, which keeps software open by tainting other software that includes GPL libraries, and LGPL, which allows libraries to be used in closed/proprietary code. There is a good argument that by combining data from different open sources in a particular way valuable results may be created, but it should also be recognised that work may be expended doing this and a financial return may need to be generated (so maybe companies shouldn’t have to open up their aggregated datasets?) Just how we balance commercial exploitation with ongoing openness and access to raw public data is yet to be seen.

(The academic research area – which also has it’s own open data movement (e.g. Panton Principles) – also suggests a different sort of tension arising from the “potential value” of a data set or aggregated data set. For example, research groups analysing data in one particular way may be loathe to release to others because they want to analyse it in another, value creating way at a later date.)

Getting the licensing right is particularly important if councils become obliged to use third party services to publish their data. For example, the grand vision of the Public Sector Transparency Board identified in this paragraph in Shadbolt’s letter to Maude states:

13. We must promote and support the development and application of open, linked data standards for public data, including the development of appropriate skills in the public services. …

But as a recent report, again from Chris Taggart, on Publishing Local Open Data – Important Lessons from the Open Election Data project suggests, there are certain challenges associated with web related development in local authorities, and in particular a significant lack of experience and expertise in dealing with Linked Data (which is not surprising – it is a relatively new, and so far arcane) technology. Here are the first four lessons, for example:

– There is a lack of ‘corporate’ awareness/understanding of open data issues, and this will inhibit take up of open, linked data publishing unless it is addressed
– There is a lack of even basic web skills at some councils
– Many councils lack web publishing resources, never mind the resources to implement open, linked data publishing
– The understanding of even the basics of linked data and the steps to publishing public data in this way is very, very limited

What this suggests to me is that it is likely that in the short term at least, the capability for publishing Linked Data will reside in specialist third party companies, possibly one of only a few companies. As Paul Geraghty discovers from the eGovernment Register in If #localgovweb supplier says “RDF WTF?” Sack em #opendata #spending:

[I]t seems to me that of 450 or so local government organisations, 357 are listed as having a “Financials” supplier **.

There are only 18 suppliers listed, and of those there are 6 Big Ones.

Between them the 6 Big Ones supply “Financials” to 326 Councils.

Don’t you think that the first one of those 6 Big Ones who natively supports LOD [Linked Open Data] as an export option (or agrees to within, say, 8 months) really ought to be favoured when bidding for new business?

Lets go further, lets say that it should be mandated that all new contracts with “Financials” suppliers include an LOD clause.

Perhaps Mr Pickles could dispatch someone to have a chat with one or two of these suppliers, or that he should have someone check that future contracts for Financial products being sold to Local Government all contain the necessary wording to make this happen?

So instead of trying to train and cajole 450 councils to FTP assorted CSV files into localdata.gov.uk (FFS) all the way through to grokking RDF, namespaces, LOD et al – why does the government not get on and make a strategy to bully and coerce 6 suppliers instead – and potentially get 326 councils teed up to produce useful LOD a bit sharpish?

Another technology option is for councils to publish their own linked data to a commercially hosted datastore. At the moment, the two companies I know of that offer “datastore” services for publishing Linked Data, at scale, are Talis, and the Stationery Office (in partnership with Garlik). It is, of course, open knowledge that one Professor Nigel Shadbolt is a director of Garlik Limited.

Music journalism and data (MA Online Journalism multimedia projects pt1)

I’ve just finished looking at the work from the Diploma stage of my MA in Online Journalism, and – if you’ll forgive the effusiveness – boy is it good.

The work includes data visualisation, Flash, video, mapping and game journalism – in short, everything you’d want from a group of people who are not merely learning how to do journalism but exploring what journalism can become in a networked age.

But before I get to the detail, a bit of background… Continue reading

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… 😉

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… 😉 If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below 🙂

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Get used to reading this…

“We have a team of developers going through the data now – and we’ll let you know here what we learn as and when we learn it.”

If you had any doubt over the concept of ‘programmer as journalist’, that quote above from The Guardian’s liveblog of the opening of the COINS database gives you a preview of things to come. While you’re at it, you might as well add in ‘statistician as journalist‘ and ‘information designer as journalist‘ – or look at my post from 2008 on New Journalists for New Information Flows. Are we there yet?

The Great Government Data Rush – what does it mean for journalists?

Earlier this week I posted briefly on what I consider to be the most significant move for journalism by the UK government since the Freedom of Information Act. But I wanted to look more systematically at what is likely to be a huge change in the information landscape that journalists deal with…

So. In the spirit of data journalism, here is an embedded spreadsheet of the timetable of data to be released by national government, local government, and other bodies. I’ve added notes on how I feel each piece of data could be important, and any useful links – but I’d like you to add any thoughts on other possibilities. Here it is:

Meanwhile, over at Data.gov.uk, the Local Data Panel has published a post inviting comment on the format that data might be supplied in, and fields it might contain.

  • As a first stage, publish the raw data and any lookup table needed to interpret it in a spreadsheet as a CSV or XML file as soon as possible. This should be put on the council’s website as a document for anyone to download. Or even published in a service such as Google Docs
  • There is not yet a national approach for publishing local authority expenditure data. This should not stop publication of data in its raw, machine-readable form. Observing such raw data being used is the only route to a national approach, should one be required
  • Publishing raw data will allow the panel and others to assess how that data could/should be presented to users. Sight of the data is worth a hundred meetings. Members of the panel will study the data, take part in the discussion and revise this advice.
  • As a second stage, informed by the discussion, the panel and users can then give feedback about publishing data (RDF, CSV, etc) in a way that can be consistent across all local authorities involving structured, regularly updated data published on the Web using open standards.

Help Me Investigate contributor and all-round good guy Neil Houston has already responded with some very interesting points.

“You’d be surprised how many times there are some systems where it’s not totally easily to identify the payment, back to the relevant invoice (apart from a manual reconciliation), you need to know the invoice side of the transactions – as that is where the cost will be booked to (as the payment details will just be crediting cash, debiting Accounts Payable).”

Local and national government open up data – starting now

Yesterday saw the publication of an incredible letter by David Cameron to government departments, including local government. It sets out a whole range of areas where data is to be released – some of it scheduled for January 2011, but some of it straight away.

You can find my thoughts about the release in this article by Laura Oliver, along with those of the likes of David Higgerson. This is probably as important an event as the passing of the FOI Act – it is more important than the launch of data.gov.uk. Note it.

5 data visualisation tips from David McCandless

Here’s another snippet from my data journalism book chapter (now published). As part of my research David McCandless, author of the very lovely book and website Information is Beautiful gave  his 5 tips for visualising data:

  1. Double source data wherever possible – even the UN and WorldBank can make mistakes
  2. Take information out – there’s a long tradition among statistical journalists of showing everything. All data points. The whole range. Every column and row. But stories are about clear threads with extraneous information fuzzed out. And journalism is about telling stories. You can only truly do that when you mask out the irrelevant or the minor data. The same applies to design which is about reducing something to its functional essence.
  3. Avoid standard abstract units – tons of carbon, billions of dollars – these kinds of units are over-used and impossible to imagine or relate to. Try to rework or process units down to ‘everyday’ measures. Try to give meaningful context for huge figures whenever possible.
  4. Self-sufficency – all graphs, charts and infographics should be self-sufficient. That is, you shouldn’t require any other information to understand them. They’re like interfaces. So each should have a clear title, legend, source, labels etc. And credit yourself. I’ve seen too many great visuals with no credit or name at the bottom.
  5. Show your workings – transparency seems like a new front for journalists. Google Docs makes it incredibly easy to share your data and thought processes with readers. Who can then participate.

Data journalism pt2: Interrogating data

This is a draft from a book chapter on data journalism (the first, on gathering data, is here). I’d really appreciate any additions or comments you can make – particularly around ways of spotting stories in data, and mistakes to avoid.

UPDATE: It has now been published in The Online Journalism Handbook.

“One of the most important (and least technical) skills in understanding data is asking good questions. An appropriate question shares an interest you have in the data, tries to convey it to others, and is curiosity-oriented rather than math-oriented. Visualizing data is just like any other type of communication: success is defined by your audience’s ability to pick up on, and be excited about, your insight.” (Fry, 2008, p4)

Once you have the data you need to see if there is a story buried within it. The great advantage of computer processing is that it makes it easier to sort, filter, compare and search information in different ways to get to the heart of what – if anything – it reveals. Continue reading