Tag Archives: opengov

Aggregated Local Government Verticals Based on LocalGov Service IDs

(Punchy title, eh?!) If you’re a researcher interested in local government initiatives or service provision across the UK on a particular theme, such as air quality, or you’re looking to start pulling together an aggregator of local council consultation exercises, where would you start?

Really – where would you start? (Please post a comment saying how you’d make a start on this before reading the rest of this post… then we can compare notes;-)

My first thought would be to use a web search engine and search for the topic term using a site:gov.uk search limit, maybe along with intitle:council, or at least council. This would generate a list of pages on (hopefully) local gov websites relating to the topic or service I was interested in. That approach is a bit hit or miss though, so next up I’d probably go to DirectGov, or the new gov.uk site, to see if they had a single page on the corresponding resource area that linked to appropriate pages on the various local council websites. (The gov.uk site takes a different approach to the old DirectGov site, I think, trying to find a single page for a particular council given your location rather than providing a link for each council to a corresponding service page?) If I was still stuck, OpenlyLocal, the site set up several years ago by Chris Taggart/@countculture to provide a single point of reference for looking up common adminsitrivia details relating to local councils, would be the next thing that came to mind. For a data related query, I would probably have a trawl around data.gov.uk, the centralised (but far form complete) UK index of open public datasets.

How much more convenient it would be if there was a “vertical” search or resource site relating to just the topic or service you were interested in, that aggregated relevant content from across the UK’s local council websites in a single place.

(Erm… or maybe it wouldn’t?!)

Anyway, here are a few notes for how we might go about constructing just such a thing out of two key ingredients. The first ingredient is the rather wonderful Local directgov services list:

This dataset is held on the Local Directgov platform which provides the deep links into Local council websites for a number of services in Directgov. The Local Authority Service details holds the local council URLS for over 240 services where the customer can directly transfer to the appropriate service page on any council in England.

The date on the dataset post is 16/09/2011, although I’m not sure if the data file itself is more current (which is one of the issues with data.gov.uk, you could argue…). Presumably, gov.uk runs off a current version of the index? (Share…. 😉 Each item in the local directgov services list carries with it a service identifier code that describes the local government service or provision associated with the corresponding web page. That it, each URL has associated with it a piece of metadata identifying a service or provision type.

Which leads to the second ingredient: the esd standards Local Government Service List. This list maps service codes onto a short key phrase description of the corresponding service. So for example, Council – consultation and community engagement is has service identifier 366, and Pollution control – air quality is 413. (See the standards page for the actual code/vocabulary list in a variety of formats…)

As a starter for ten, I’ve pulled the Directgov local gov URL listing and local gov service list into scraperwiki (Local Gov Web Pages). Using the corresponding scraper API, we can easily run a query looking up service codes relating to pollution, for example:

select * from `serviceDesc` where ToName like '%pollution%'

From this, we can pick up what service code we need to use to look up pages related to that service (413 in the case of air pollution):

select * from `localgovpages` where LGSL=413

We can also get a link to an HTML table (or JSON representation, etc) of the data via a hackable URI:


(Hackable in the sense we can easily change the service code to generate the table for the service with that code.)

So that’s the starter for 10. The next step that comes to my mind is to generate a dynamic Google custom search engine configuration file that defines a search engine that will search over just those URLs (or maybe those URLs plus the pages they link to). This would then provide the ability to generate custom search engines on the fly that searched over particular service pages from across localgov in a single, dynamically generated vertical.

A second thought is to grab those page, index them myself, crawl them/scrape them to find the pages they link to, and index those pages also (using something like tf-idf within each local council site to identify and remove common template elements from the index). (Hmmm… that could be an interesting complement to scraperwiki… SolrWiki, a site for compiling lists of links, indexing them, crawling them to depth N, and then configuring search ranking algorithms over the top of them… Hmmm… It’s a slightly different approach to generating custom search engines as a subset of a monolithic index, which is how the Google CSE and (previously) the Yahoo BOSS engines worked… Not scaleable, of course, but probably okay for small index engines and low thousands of search engines?)


So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

COI web report 2009-10 table 1

COI web report 2009-10 table 2

COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings


COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… 😉

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… 😉 If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below 🙂