Quick and Dirty Recipe: Merging (Concatenating) Multiple CSV files (ODA Spending)

There’s been a flurry of tweets over the last few days about LOCOG’s exemption from FOI (example LOCOG response to an FOI request), but the Olympic Delivery Authority (ODA, one of the owner stakeholders) is rather more open, and publishes its spends over £25k: ODA Finance: Transparency Reports.

CSV files containing spend on a monthly basis are available from the site, using a consistent CSV file format each time (I think…). For what it’s worth, I thought it might be worth sharing a pragmatic, though not ideal, Mac/Linux/unixtools commandline recipe for generating a single file containing all this data.

  1. Right-click and download each of the CSV files on the page to the same directory (eg odaSpending) on your local machine. (There are easier ways of doing this – I tried wget on the command line, but got an Access Denied response (workaround anyone?); there are probably more than a few browser extensions/plugins that will also download all the files linked to from a page. If so, you just want to grab the csv files; if you get them all, from the command line, just copy the csv files to a new directory: eg mkdir csvfiles;cp *.csv csvfiles)
  2. On the commandline, change directory to the files directory – eg cd odaSpending/csvfiles; then join all the files together: files=*; cat $files > odaspending.csv
  3. You should now have a big file, odaspending.csv, containing all the data, although it will also contain multiple header rows (each csv file had its own header row). Open the file in a text editor (I use TextWrangler), copy from the start of the first line to the start of the second (ie copy the header row, including the end of line/carriage return), then do a Find on the header and global Replace with nothing replacing the search string. Then, depending where you started the replace, maybe paste the header (if required) back into the first row

To turn the data file into something you can explore more interactively, upload it to something like Google Fusion Tables, as I did here (data to May 2012): ODA Spending in Google Fusion Tables

Note that this recipe is a pragmatic one. Unix gurus would surely be able to work out far more efficient scripts that concatenate the files after stripping out the header in all but the first file, for example, or that maybe even check the columns are the same etc etc. But if you want something quick and dirty, this is one way of doing it… (Please feel free to add alternative recipes for achieving the same thing in the comments…)

PS here’s an example of one sort of report you can then create in Fusion Tables – ODA spend with G4S; here’s another: Seconded staff

Advertisements