There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them:
Scraping company information
Chris Taggart explains how he built a database of corporations which will be particularly useful to journalists and anyone looking at public spending:
“Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia) … In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.”
OpenCorporates are also offering a bounty for programmers who can scrape company information from other jurisdictions.
Scraperwiki on the front page of The Guardian…
“James Ball’s story is helped and supported by a ScraperWiki script that took data from registers across parliament that is located on different servers and aggregates them into one source table that can be viewed in a spreadsheet or document. This is now a living source of data that can be automatically updated. http://scraperwiki.com/scrapers/all_party_groups/
“Journalists can put down markers that run and update automatically and they can monitor the data over time with the objective of holding ‘power and money’ to account. The added value of this technique is that in one step the data is represented in a uniform structure and linked to the source thus ensuring its provenance. The software code that collects the data can be inspected by others in a peer review process to ensure the fidelity of the data.”
…and on Channel 4’s Dispatches
“ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns … The first is a bubble chart of what central Government owns. The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.
“The second is a map of brownfield landed owned by local councils in England … The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.
Which chiropractors were making “bogus” claims?
This is an example from last summer. Following the Simon Singh case Simon Perry wrote a script to check which chiropractors were making the same “bogus claims” that Singh was being sued over:
“The BCA web site lists all it’s 1029 members online, including for many of them, about 400 web site URLs. I wrote a quick computer program to download the member details, record them in a database and then download the individual web sites. I then searched the data for the word “colic” and then manually checked each site to verify that the chiropractors were either claiming to treat colic, or implying that chiropractic was an efficacious treatment for it. I found 160 practices in total, with around 500 individual chiropractors.
“The final piece in the puzzle was a simple mail-merge. Not wanting to simultaneously report several quacks to the same Trading Standards office, I limited the mail-merge to one per authority and sent out 84 letters.
“On the 10th, the science blogs went wild when Le Canard Noir published a very amusing email from the McTimoney Chiropractic Association, advising their members to take down their web site. It didn’t matter, I had copies of all the web sites.”