Tag Archives: opencorporates

How we did it: investigating Nigerian football agents

Last year I was part of a team — with Yemisi Akinbobola and  Ogechi Ekeanyawu — that won a CNN MultiChoice African Journalist of the Year award for an investigation into Nigerian football agents. The project, funded by Journalismfund.eu, and also available in an immersive longform version, combined data journalism and networked production with on-the-ground reporting. Here are some of the lessons we drew from the project… Continue reading


Come help dig into data on football agents – and eat pizza


Pizzas and footballs are both round. That’s enough justification for us. Image: Adam Kuban

Next Friday (November 13) I’ll be holding a day full of activities in Birmingham for anyone interested in digging into the world of football agents.

You won’t need any special skills – you can take part by doing basic internet searches, or you can bring your data journalism mojo and play with the data we’ve already gathered.

And we’ll also have people on hand to show you a range of useful techniques if you’re interested.

The special hackday is part of an investigation I’m doing, and we’re also working with OpenCorporates as part of their #FlashHacks series to build open data on football agents, identifying the biggest operators and their ownership structures.

If you can’t make it for the whole day feel free to drop in for part of it. Oh, and there’ll be pizza, stickers and t-shirts.

You can sign up and find more details about location and timings here.

5 tips for a data journalism workflow: part 1 – data newswires and archiving

Earlier this year I spoke at the BBC’s Data Fusion Day (you can find a liveblog of the event on Help Me Investigate) about data journalism workflows. The presentation slides are embedded below (the title is firmly tongue-in-cheek), but I thought I’d explain a bit more in a series of posts – beginning here.

Data journalism workflow 1: Set up data newswires

Most newsrooms take a newswire of some sort – national and international news from organisations like the Press Association, Reuters, and Associated Press.

Data journalism is no exception. If you want to find stories in data, it helps to know what data is coming out, when it comes out.

Continue reading

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))

#note - the opencorporates api also offers a search:  companies/search

def getOCcompanyData(ocid):
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            director=desc.replace('DIRECTOR APPOINTED ','')
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')

for entity in entities['result']:

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)


for row in data:
    if row['fdirector'] not in directors:
    if row['ocname'] not in companies:

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data: http://api.opencorporates.com/companies/gb/00102498/data

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form http://api.opencorporates.com/data/2601371, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

  • value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
  • value.parseJson().datum.attributes.mark_text
  • value.parseJson().datum.attributes.representative_address_lines
  • value.parseJson().datum.attributes.representative_name_lines
  • value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘http://www.wipo.int/romarin/images/&#8217; + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

null null
null null
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
“[“Murgitroyd & Company”]“ “[“Scotland House,”,”165-169 Scotland Street”,”Glasgow G5 8PL”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
“[“BP Group Trade Marks”]“ “[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“BP Group Trade Marks”]“ “[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
“[“ROBERT WILLIAM BOAD”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.

Making it easier to join the dots of government: publicbodies.org


publicbodies.org - jargon translation: this could be very useful

If you deal with information on government departments you may want to offer your help in improving a new project that aims to make it easier to combine government data.

Publicbodies.org is attempting to do for government data what OpenCorporates does for company data: create unique resources that allow you to distinguish between similar-sounding departments, and grab extra contextual data along the way.

Created at last week’s Open Government Data Camp in Warsaw, the project currently contains basic data on German, UK and EU public bodies.

In a post introducing the project, Friedrich Lindenberg explains how the seed data for the site was compiled from freedom of information request sites such as WhatDoTheyKnow and AskTheEU.

The project still needs data on government departments in other countries, however.

This is a promising service which already includes a reconciliation service for Google Refine (in other words, if you have a spreadsheet that mentions government departments, you can relatively easily bring in extra data in just a few clicks).

And news organisations wanting to steal a march on their rivals on this front should seriously consider contributing some time to making it better.

Anyone wanting to help can comment on the blog post or find Friedrich @pudo on Twitter.

All the news that’s fit to scrape

Channel 4/Scraperwiki collaboration

There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them:

Scraping company information

Chris Taggart explains how he built a database of corporations which will be particularly useful to journalists and anyone looking at public spending:

“Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia) … In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.”

OpenCorporates are also offering a bounty for programmers who can scrape company information from other jurisdictions.

Scraperwiki on the front page of The Guardian…

The Scraperwiki blog gives the story behind a front page investigation by James Ball on lobbyist influence in the UK Parliament: Continue reading