Tag Archives: data

Merging Datasets with Common Columns in Google Refine

It’s an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here’s a way of doing it in Google Refine…

Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy of the Guardian data blog (pulled through the Google Spreadsheets to Yahoo pipes proxy mentioned here):

- University fees data (CSV via pipes proxy)

- University HESA stats, 2010 (CSV via pipes proxy)

We can now merge data from the two projects by creating a new column from values an existing column within one project that are used to index into a similar column in the other project. Looking at the two datasets, both HESA Code and institution/University look like candidates for merging the data. Which should we go with? I’d go with the unique identifier (i.e. HESA code in the case) every time…

First, create a new column:

Now do the merge, using the cell.cross GREL (Google Refine Expression Language) command. Trivially, and pinching wholesale from the documentation example, we might use the following command to bring in Average Teaching Score data from the second project into the first:

cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0]

Note that there is a null entry and an error entry. It’s possible to add a bit of logic to tidy things up a little:

if (value!='null',cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0],'')

Here’s the result:

Coping with not quite matching key columns

Another situation that often arises is that you have two columns that almost but don’t quite match. For example, this dataset has a different name representation that the above datasets (Merge Test C):

There are several text processing tools that we can use to try to help us match columns that differ in well-structured ways:

In the above case, where am I creating a new column based on the contents of the Institution column in Merge Test C, I’m using a couple of string processing tricks… The GREL expression may look complicated, but if you build it up in a stepwise fashion it makes more sense.

For example, the command replace(value,"this", "that") will replace occurrences of “this” in the string defined by value with “that”. If we replace “this” with an empty string (” (two single quotes next to each other) or “” (two double quotes next to each other)), we delete it from value: replace(value,"this", "")

The result of this operation can be embedded in another replace statement: replace(replace(value,"this", "that"),"that","the other"). In this case, the first replace will replace occurrences of “this” with “that”; the result of this operation is passed to the second (outer) replace function, which replaces “that” with “the other”). Try building up the expression in realtime, and see what happens. First use:
toLowercase(value)
(what happens?); then:
replace(toLowercase(value),'the','')
and then:
replace(replace(toLowercase(value),'the',''),'of','')

The fingerprint() function then separates out the individual words that are left, orders them, and returns the result (more detail). Can you see how this might be used to transform a column that originally contains “The University of Aberdeen” to “aberdeen university”, which might be a key in another project dataset?

When trying to reconcile data across two different datasets, you may find you need to try to minimise the distance between almost common key columns by creating new columns in each dataset using the above sorts of technique.

Be careful not to create false positive matches though; and also be mindful that not everything will necessarily match up (you may get empty cells when using cell.cross; (to mitigate this, filter rows using a crossed column to find ones where there was no match and see if you can correct them by hand). Even if you don’t completely successful cross data from one project to another, you might manage to automate the crossing of most of the rows, minimising the amount of hand crafted copying you might have to do to tidy up the real odds and ends…

So for example, here’s what I ended up using to create a “Pure key” column in Merge Test C:
fingerprint(replace(replace(replace(toLowercase(value),'the',''),'of',''),'university',''))

And in Merge Test A I create a “Complementary Key” column from the University column using fingerprint(value)

From the Complementary Key column in Merge Test A we call out to Merge Test C: cell.cross("Merge Test C", "Pure key").cells["UCAS ID"].value[0]

Obviously, this approach is far from ideal (and there may be more “correct” and/or efficient ways of doing this!) and the process described above is admittedly rather clunky, but it does start to reveal some of what’s involved in trying to bring data across to one Google Refine project from another using columns that don’t quite match in the original dataset, although they do (nominally) refer to the same thing, and does provide a useful introductory exercise to some of the really quite powerful text processing commands in Google Refine …

First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))

Hmm… can we set the filter value dynamically I wonder?

manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]

Fragments: Glueing Different Data Sources Together With Google Refine

I’m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I’m not sure how it’ll play out in the end, but here are some fragments….

Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes)

Firstly, getting data into Google Refine… I had hoped to be able to pull a subset of data from a Google Spreadsheet into Google Refine by importing CSV data obtained from the spreadsheet via a query generated using my Google Spreadsheet/Guardian datastore explorer (see Using Google Spreadsheets as a Database with the Google Visualisation API Query Language for more on this) but it seems that Refine would rather pull the whole of the spreadsheet in (or at least, the whole of the first sheet (I think?!)).

Instead, I had to tweak create a proxy to run the query via a Yahoo Pipe (Google Spreadsheet as a database proxy pipe), which runs the spreadsheet query, gets the data back as CSV, and then relays it forward as JSON:

Here’s the interface to the pipe – it requires the Google spreadsheet public key id, the sheet id, and the query… The data I’m using is a spreadsheet maintained by the Guardian datastore containing UK university fees data (spreadsheet.

You can get the JSON version of the data out directly, or a proxied version of the CSV, as CSV via the More options menu…

Using the Yahoo Pipes CSV output URL, I can now get a subset of data from a Google Spreadsheet into Google Refine…

Here’s the result – a subset of data as defined by the query:

We can now augment this data with data from another source using Google Refine’s ability to import/fetch data from a URL. In particular, I’m going to use the Yahoo Pipe described above to grab data from a different spreadsheet and pass it back to Google Refine as a JSON data feed. (Google spreadsheets will publish data as JSON, but the format is a bit clunky…)

To test out my query, I’m going to create a test query in my datastore explorer using the Guardian datastore HESA returns (2010) spreadsheet URL (http://spreadsheets1.google.com/spreadsheet/ccc?hl&key=tpxpwtyiYZwCMowl3gNaIKQ#gid=0) which also has a column containing HESA numbers. (Ultimately, I’m going to generate a URL that treats the Guardian datastore spreadsheet as a database that lets me get data back from the row with a particular HESA code column value. By using the HESA number column in Google Refine to provide the key, I can generate a URL for each institution that grabs its HESA data from the Datastore HESA spreadsheet.)

Hit “Preview Table Headings”, then scroll down to try out a query:

Having tested my query, I can now try the parameters out in the Yahoo pipe. (For example, my query is select D,E,H where D=21 and the key is tpxpwtyiYZwCMowl3gNaIKQ; this grabs data from columns D, E and H where the value of D (HESA Code) is 21). Grab the JSON output URL from the pipe, and use this as a template for the URL template in Google Refine. Here’s the JSON output URL I obtained:

http://pipes.yahoo.com/pipes/pipe.run?_id=4562a5ec2631ce242ebd25a0756d6381
&_render=json&key=tpxpwtyiYZwCMowl3gNaIKQ
&q=select+D%2CE%2CH+where+D%3D21

Remember, the HESA code I experiment with was 21, so this is what we want to replace in the URL with the value from the HESA code column in Google Refine…

Here’s how we create the URLs built around/keyed by an appropriate HESA code…

Google Refine does its thing and fetches the data…

Now we process the JSON response to generate some meaningful data columns (for more on how to do this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure).

First say we want to create a new column based on the imported JSON data:

Then parse the JSON to extract the data field required in the new column.

For example, from the HESA data we might extract the Expenditure per student /10:

value.parseJson().value.items[0]["Expenditure per student / 10"]

or the Average Teaching Score (value.parseJson().value.items[0]["Average Teaching Score"]):

And here’s the result:

So to recap:

- we use a Yahoo Pipe to query a Google spreadsheet and get a subset of data from it;
- we take the CSV output from the pipe and use it to create a new Google Refine database;
- we note that the data table in Google Refine has a HESA code column; we also note that the Guardian datastore HESA spreadsheet has a HESA code column;
- we realise we can treat the HESA spreadsheet as a database, and further that we can create a query (prototyped in the datastore explorer) as a URL keyed by HESA code;
- we create a new column based on HESA codes from a generated URL that pulls JSON data from a Yahoo pipe that is querying a Google spreadsheet;
- we parse the JSON to give us a couple of new columns.

And there we have it – a clunky, but workable, route for merging data from two different Google spreadsheets using Google Refine.

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah - forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0].geometry.location.lat).

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE - apparently, HESA codes are in the spreadsheet.... ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)

Twitter & DataSift launch live social data services for under £1 (useful)

Journalists with an interest in realtime data should keep an eye on a forthcoming service from DataSift which promises to allow users to access a feed of Twitter tweets filtered along any combination of over 40 qualities.

In addition – and perhaps more interestingly – the service will also offer extra context:

“from services including Klout (influence metrics), PeerIndex (influence), Qwerly (linked social media accounts) and Lexalytics (text and sentiment analysis). Storage, post-processing and historical snapshots will also be available.”

The pricing puts this well within the reach of not only professional journalists but student ones too: for less than 20p per hour (30 cents) you will be able to apply as many as 10,000 keyword filters.

ReadWriteWeb describe a good example of how this may work out journalistically:

“Want a feed of negative Tweets written by C-level execs about any of 10,000 keywords? Trivial! Basic level service, Halstead says! Want just the Tweets that fit those criteria and are from the North Eastern United States? That you’ll have to pay a little extra for.”

Getting Started With Local Council Spending Data

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
- compare the offerings with each other in more detail, e.g. the columns each view provides;
- compare the offerings with the guidance on release of council spending data;
- see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
- explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard

Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers

Where can I find a list of hospitals in the UK along with their location data? Or historical weather data for the UK? Or how do I find the county from a postcode, or a book title from its ISBN? And is there any way you can give me RDF Linked Data in a format I can actually use?!

With increasing amounts of data available, it can still be hard to:

- find the data you you want;
- query a datasource to return just the data you want;
- get the data from a datasource in a particular format;
- convert data from one format to another (Excel to RDF, for example, or CSV to JSON);
- get data into a representation that means it can be easily visualised using a pre-existing tool.

In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of. Or maybe you know the data is in Linked Data store on data.gov.uk, but you just can’t figure how to get it out?

This is where GetTheData.org comes in. Get The Data arose out of a conversation between myself and Rufus Pollock at the end of last year, which resulted with Rufus setting up the site now known as getTheData.org.

getTheData.org

The idea behind the site is to field questions and answers relating to the practicalities of working with public open data: from discovering data sets, to combining data from different sources in appropriate ways, getting data into formats you can happily work with, or that will play nicely with visualisation or analysis tools you already have, and so on.

At the moment, the site is in its startup/bootstrapping phase, although there is already some handy information up there. What we need now are your questions and answers…

So, if you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

If you’re looking for data as part of a research project, but can’t find it or can’t get it in an appropriate form that lets you link it to another data set, post a question to GetTheData.

If you want to do some graphical analysis on a data set, but don’t know what tool to use, or how to get the data in the right format for a particular tool, that’d be a good question to ask too.

Which is to say: if you want to GetTheData, but can’t do so for whatever reason, just ask… GetTheData.org

A portal for European government data: PublicData.eu plans

The Open Knowledge Foundation have published a blog post with notes on a site they’re developing to gather together data from across Europe. The post notes that the growth of data catalogues at both a national level (mentioning the Digitalisér.dk data portal run by the Danish National IT and Telecom Agency) and “countless city level initiatives across Europe as well – from Helsinki to Munich, Paris to Zaragoza.” with many more initiatives “in the pipeline with plans to launch in the next 6 to 12 months.”

PublicData.eu will, it says:

“Provide a single point of access to open, freely reusable datasets from numerous national, regional and local public bodies throughout Europe.

“[It] will harvest and federate this information to enable users to search, query, process, cache and perform other automated tasks on the data from a single place. This helps to solve the “discoverability problem” of finding interesting data across many different government websites, at many different levels of government, and across the many governments in Europe.”

What is perhaps even more interesting for journalists is that the site plans to:

“Capture (proposed) edits, annotations, comments and uploads from the broader community of public data users.”

That might include anything from cleaner versions of data, to instances where developers match datasets together, or where users add annotations that add context to a particular piece of information.

Finally there’s a general indication that the site hopes to further lower the bar for data and collaborative journalism by:

“Providing basic data analysis and visualisation tools together with more in-depth resources for those looking to dig deeper into the data. Users will be able to personalise their data browsing experience by being able to save links and create notes and comments on datasets.”

More in the post itself. Worth keeping an eye on.

Games, systems and context in journalism at News Rewired

I went to News Rewired on Thursday, along with dozens of other journalists and folk concerned in various ways with news production. Some threads that ran through the day for me were discussions of how we publish our data (and allow others to do the same), how we link our stories together with each other and the rest of the web, and how we can help our readers to explore context around our stories.

Continue reading

Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)