Monthly Archives: March 2012

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data: http://api.opencorporates.com/companies/gb/00102498/data

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form http://api.opencorporates.com/data/2601371, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
value.parseJson().datum.attributes.mark_text
value.parseJson().datum.attributes.representative_address_lines
value.parseJson().datum.attributes.representative_name_lines
value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
http://www.wipo.int/romarin/images/XX/YY/XXYYNN.typ
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘http://www.wipo.int/romarin/images/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

	null	null
	null	null
	“[“MURGITROYD & COMPANY”]“	“[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
	“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
	“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
	“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
	“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
	“[“BP GROUP TRADE MARKS”]“	“[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
	“[“Murgitroyd & Company”]“	“[“Scotland House,”,”165-169 Scotland Street”,”Glasgow G5 8PL”]“
	“[“BP GROUP TRADE MARKS”]“	“[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
	“[“BP Group Trade Marks”]“	“[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
	“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
	“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
	“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
	“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
	“[“MURGITROYD & COMPANY”]“	“[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
	“[“MURGITROYD & COMPANY”]“	“[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
	“[“MURGITROYD & COMPANY”]“	“[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
	“[“MURGITROYD & COMPANY”]“	“[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
	“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
	“[“BP Group Trade Marks”]“	“[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
	“[“ROBERT WILLIAM BOAD”,”GROUP TRADE MARKS”]“	“[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
	“[“BP GROUP TRADE MARKS”]“	“[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.

FAQ: Trusting ‘the blogosphere’

1 Reply

Note: for those coming from Poynter’s summary of part of this post, the phrase ‘don’t have to be trained’ has an ambiguity that could be misunderstood. I’ve expanded on the relevant section to clarify.

Another set of answers to another set of questions (FAQs). These are posed by a UK university student:

How would you define the blogosphere?

The blogosphere is, technically, all blogs – but those don’t often have much connection to each other. I think it’s better to talk of many ‘blogospheres’ around different topics, e.g. the political blogosphere and so on. Continue reading →

VIDEO: How to track people online

Guardian to act as platform for arts organisations

Teaching entrepreneurial journalism: the elephant in the room – editorial independence

4 Replies

How many journalism students see editorial's encounter with commerce. Image by Scot A. Harvest

There’s a wonderfully written post on Sean Blanda’s blog about fixing entrepreneurial journalism courses. Unusually, the post demonstrates a particularly acute understanding of the dynamics involved in teaching (Lesson One, based on my experience of teaching ‘strategic learners’, strikes me as a particularly effective tactic*, while Lesson Two addresses the most common problem in students’ ideas: vagueness, or ‘mass marketism’).

But it also reminded me of a conversation I had recently about journalism students’ reactions to being taught entrepreneurialism – and the one lesson that’s missing from Sean’s list.

It’s this lesson: “Why?” Continue reading →

Where chasing traffic meets “important” journalism: Gawker’s experiment

3 Replies

Nieman reports on a fascinating experiment in traffic-chasing content from Gawker which provides all sorts of insights into just how valuable that content is, and where it sits in the wider editorial mix. Here’s what they did:

“Each day for two weeks, a [different] single staff writer would be assigned “traffic-whoring duty.”” Continue reading →

How to investigate Wikipedia edits

2 Replies

Ian Silvera (@ianjsilvera) gives a step-by-step guide on how to find out who’s behind changes on a Wikipedia page. Cross-posted from the Help Me Investigate blog.

First, click on the ‘view history’ tab at the top right of the Wikipedia entry you are interested in. You should then be directed to a page that lists all the edits that have occurred on that entry. It looks like this: http://en.wikipedia.org/w/index.php?title=Paul_Bradshaw_(journalist)&action=history

Second, to identify if someone has been deleting unhelpful criticisms of an organisation or person on their Wikipedia entry, you could read through each edit, but with large Wikipedia entries this exercise would be too time-consuming. Instead, look for large redactions. Continue reading →

ITV News’s new website – welcome to the news stream