Category Archives: online journalism

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
except:
    ockey=''

rurl='http://opencorporates.com/reconcile/gb?query=tesco'
#note - the opencorporates api also offers a search:  companies/search
entities=simplejson.load(urllib.urlopen(rurl))

def getOCcompanyData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/data'+'?api_token='+ockey
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?per_page=100&api_token='+ockey
    tmpdata=simplejson.load(urllib.urlopen(ocurl))
    ocdata=tmpdata['filings']
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        page=str(tmpdata['page']+1)
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
        ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?page='+page+'&per_page=100&api_token='+ockey
        tmpdata=simplejson.load(urllib.urlopen(ocurl))
        ocdata=ocdata+tmpdata['filings']
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    ddata={}
    ddata['ocname']=ocname
    ddata['ocid']=ocid
    ddata['fdesc']=ffiling["description"]
    ddata['fdirector']=director
    ddata['fdate']=ffiling["date"]
    ddata['fid']=ffiling["id"]
    ddata['ftyp']=ffiling["filing_type"]
    ddata['fcode']=ffiling["filing_code"]
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            desc=filing["filing"]["description"]
            director=desc.replace('DIRECTOR APPOINTED ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            desc=filing["filing"]["description"]
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)

for entity in entities['result']:
    ocid=entity['id']
    ocname=entity['name']
    filings=getOCfilingData(ocid)
    logDirectors(ocname,ocid,filings)

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)

DG=nx.DiGraph()

directors=[]
companies=[]
for row in data:
    if row['fdirector'] not in directors:
        directors.append(row['fdirector'])
        DG.add_node(directors.index(row['fdirector']),label=row['fdirector'],name=row['fdirector'])
    if row['ocname'] not in companies:
        companies.append(row['ocname'])
        DG.add_node(row['ocid'],label=row['ocname'],name=row['ocname'])   
    DG.add_edge(directors.index(row['fdirector']),row['ocid'])

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(DG)

print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});

    google.setOnLoadCallback(drawTable);

    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?
    //json_table.setColumns([0,1,2,3,4,5,6,7])

    var formatter = new google.visualization.PatternFormat('<a href="http://www.bbc.co.uk/programmes/{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'
      }
    });

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);

    }

The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL='http://opencorporates.com/reconcile?query='+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    try:
        recData=simplejson.load(urllib.urlopen(ocrecURL))
    except:
        recData={'result':[]}
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:
            record['ocData']=recData['result'][0]
            record['ocID']=recData['result'][0]['uri']
            record['ocName']=recData['result'][0]['name']

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search

Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API

In Visualising Networks in Gephi via a Scraperwiki Exported GEXF File I gave an example of how we can publish arbitrary serialised output file formats from Scraperwiki using the GEXF XML file format as a specific example. Of more general use, however, may be the ability to export Scraperwiki data using the Google visualisation API DataTable format. Muddling around the Google site last night, I noticed the Google Data Source Python Library that makes it easy to generate appropriately formatted JSON data that can be consumed by the (client side) Google visualisation library. (This library provides support for generating line charts, bar charts, sortable tables, etc, as well as interactive dashboards.) A tweet to @frabcus questioning whether the gviz_api Python library was available as a third party library on Scraperwiki resulted in him installing it (thanks, Francis:-), so this post is by way of thanks…

Anyway, here are a couple of examples of how to use the library. The first is a self-contained example (using code pinched from here) that transforms the data into the Google format and then drops it into an HTML page template that can consume the data, in this case displaying it as a sortable table (GViz API on scraperwiki – self-contained sortable table view [code]):

Of possibly more use in the general case is a JSONP exporter (example JSON output (code)):

Here’s the code for the JSON feed example:

import scraperwiki
import gviz_api

#Example of:
## how to use the Google gviz Python library to cast Scraperwiki data into the Gviz format and export it as JSON

#Based on the code example at:
#http://code.google.com/apis/chart/interactive/docs/dev/gviz_api_lib.html

scraperwiki.sqlite.attach( 'openlearn-units' )
q = 'parentCourseCode,name,topic,unitcode FROM "swdata" LIMIT 20'
data = scraperwiki.sqlite.select(q)

description = {"parentCourseCode": ("string", "Parent Course"),"name": ("string", "Unit name"),"unitcode": ("string", "Unit Code"),"topic":("string","Topic")}

data_table = gviz_api.DataTable(description)
data_table.LoadData(data)

json = data_table.ToJSon(columns_order=("unitcode","name", "topic","parentCourseCode" ),order_by="unitcode")

scraperwiki.utils.httpresponseheader("Content-Type", "application/json")
print 'ousefulHack('+json+')'

I hardcoded the wraparound function name (ousefulHack), which then got me wondering: is there a safe/trusted/approved way of grabbing arguments out of the URL in Scraperwiki so this could be set via a calling URL?

Anyway, what this shows (hopefully) is an easy way of getting data from Scraperwiki into the Google visualisation API data format and then consuming either via a Scraperwiki view using an HTML page template, or publishing it as a Google visualisation API JSONP feed that can be consumed by an arbitrary web page and used direclty to drive Google visualisation API chart widgets.

PS as well as noting that the gviz python library “can be used to create a google.visualization.DataTable usable by visualizations built on the Google Visualization API” (gviz_api.py sourcecode), it seems that we can also use it to generate a range of output formats: Google viz API JSON (.ToJSon), as a simple JSON Response (. ToJSonResponse), as Javascript (“JS Code”) (.ToJSCode), as CSV (.ToCsv), as TSV (.ToTsvExcel) or as an HTML table (.ToHtml). A ToResponse method (ToResponse(self, columns_order=None, order_by=(), tqx=””)) can also be used to select the output response type based on the tqx parameter value (out:json, out:csv, out:html, out:tsv-excel).

PPS looking at eg https://spreadsheets.google.com/tq?key=rYQm6lTXPH8dHA6XGhJVFsA&pub=1 which can be pulled into a javascript google.visualization.Query(), it seems we get the following returned:
google.visualization.Query.setResponse({"version":"0.6","status":"ok","sig":"1664774139","table":{ "cols":[ ... ], "rows":[ ... ] }})
I think google.visualization.Query.setResponse can be a user defined callback function name; maybe worth trying to implement this one day?

Visualising Networks in Gephi via a Scraperwiki Exported GEXF File

How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.

A quick bit of backstory first…

A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue

So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:

The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).

As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:

The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).

From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)

I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)

To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)

Here’s how I built the graph:

import scraperwiki
import urllib
import networkx as nx

scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data = scraperwiki.sqlite.select(q)

G=nx.Graph()

topics=[]
for row in data:
    G.add_node(row['unitcode'],label=row['unitcode'],name=row['name'],parentCC=row['parentCourseCode'])
    topic=row['topic']
    if topic not in topics:
        topics.append(topic)
    tID=topics.index(topic)
    topicID='topic_'+str(tID)
    G.add_node(topicID,label=topic,name=topic)     
    G.add_edge(topicID,row['parentCourseCode'])
    G.add_edge(row['unitcode'],row['parentCourseCode'])

Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.

Here’s the code I hacked out as a result of that little investigation:

import networkx.readwrite.gexf as gf

writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(G)

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.

Here’s the full code for the view.

And here’s the GEXF view:

Save this file with a .gexf suffix and you can then open the file directly into Gephi.

Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.

PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.

University Funding – A Wider View

A post on the Guardian Datablog yesterday (Higher education funding: which institutions will be affected?) alerted me to the release of HEFCE’s “provisional allocations of recurrent funding for teaching and research, and the setting of student number control limits for institutions, for academic year 2012-13″ (funding data).

Here are the OU figures for teaching:

Funding for old-regime students (mainstream) Funding for old-regime students (co-funding) High cost funding for new-regime students Widening participation Teaching enhancement and student success Other targeted allocations Other recurrent teaching grants Total teaching funding
59,046,659 0 2,637,827 23,273,796 17,277,704 22,619,320 3,991,473 128,846,779

HEFCE preliminary teaching funding allocations to the Open University, 2012-13

Of the research funding for 2012-13, mainstream funding was 8,030,807, the RDP supervision fund came in at 1,282,371, along with 604,103 “other”, making up the full 9,917,281 research allocation.

Adding Higher Education Innovation Funding of 950,000, the OU’s total allocation was 139,714,060.

So what other funding comes into the universities from public funds?

Open Spending publishes data relating to spend by government departments to named organisations, so we can search that for data spent by government departments with the universities (for example, here is a search on OpenSpending.org for “open university”:

Given the amounts spent by public bodies on consultancy (try searching OpenCorporates for mentions of PriceWaterhouseCoopers, or any of EDS, Capita, Accenture, Deloitte, McKinsey, BT’s consulting arm, IBM, Booz Allen, PA, KPMG (h/t @loveitloveit)), university based consultancy may come in reasonably cheaply?

The universities also receive funding for research via the UK research councils (EPSRC, ESRC, AHRC, MRC, BBSRC, NERC, STFC) along with innovation funding from JISC. Unpicking the research council funding awards to universities can be a bit of a chore, but scrapers are appearing on Scraperwiki that make for easier access to individual grant awards data:

  • AHRC funding scraper; [grab data using queries of the form select * from `swdata` where organisation like "%open university%" on scraper arts-humanities-research-council-grants]
  • EPSRC funding scraper; [grab data using queries of the form select * from `grants` where department_id in (select distinct id as department_id from `departments` where organisation_id in (select id from `organisations` where name like "%open university%")) on scraper epsrc_grants_1]
  • ESRC funding scraper; [grab data using queries of the form select * from `grantdata` where institution like "%open university%" on scraper esrc_research_grants]
  • BBSRC funding [broken?] scraper;
  • NERC funding [broken?] scraper;
  • STFC funding scraper; [grab data using queries of the form select * from `swdata` where institution like "%open university%" on scraper stfc-institution-data]

In order to get a unified view over the detailed funding of the institutions from these different sources, the data needs to be reconciled. There are several ID schemes for identifying universities (eg UCAS or HESA codes; see for example GetTheData: Universities by Mission Group) but even official data releases tend not make use of these, preferring instead to rely solely on insitution names, as for example in the case of the recent HEFCE provisional funding data release [DOh! This is not the case – identifiers are there, apparently (I have to admit, I didn’t check and was being a little hasty… See the contribution/correction from David Kernohan in the comments to this post…]:

For some time, I’ve been trying to put my finger on why data releases like this are so hard to work with, and I think I’ve twigged it… even when released in a spreadsheet form, the data often still isn’t immediately “database-ready” data. Getting data from a spreadsheet into a database often requires an element of hands-on crafting – coping with rows that contain irregular comment data, as well as handling columns or rows with multicolumn and multirow labels. So here are a couple of things that would make life easier in the short term, though they maybe don’t represent best practice in the longer term…:

1) release data as simple CSV files (odd as it may seem), because these can be easily loaded into applications that can actually work on the data as data. (I haven’t started to think too much yet about pragmatic ways of dealing with spreadsheets where cell values are generated by formulae, because they provide an audit trail from one data set to derived views generated from that data.)

2) have a column containing regular identifiers using a known identification scheme, for example, HESA or UCAS codes for HEIs. If the data set is a bit messy, and you can only partially fill the ID column, then only partially fill it; it’ll make life easier joining those rows at least to other related datasets…

As far as UK HE goes, the JISC monitoring unit/JISCMU has a an api over various administrative data elements relating to UK HEIs (eg GetTheData: Postcode data for HE and FE institutes, but I don’t think it offers a Google Refine reconciliation service, (ideally with some sort of optional string similarity service)…? Yet?! 😉 maybe that’d make for a good rapid innovation project???

PS I’m reminded of a couple of related things: Test Your RESTful API With YQL, a corollary to the idea that you can check your data at least works by trying to use it (eg generate a simple chart from it) mapped to the world of APIs: if you can’t easily generate a YQL table/wrapper for it, it’s maybe not that easy to use? 2) the scraperwiki/okf post from @frabcus and @rufuspollock on the need for data management systems not content management systems.

PPS Looking at the actual Guardian figures reveals all sorts of market levers appearing… Via @dkernohan, FT: A quiet Big Bang in universities

Reduced Relevance – the downside of social, mobile news

Facebook Activity Plugin

News moves so quickly that your Facebook ‘friends’ just can’t keep up.

In a guest post for OJB, Neil Thurman highlights a new research report that suggests the increased availability of news on mobile platforms, and its harnessing of social networks—like Facebook—to power recommendations, comes at a price: stories that are less relevant to readers’ interests than those recommended by editors and found on news providers’ traditional websites.

Continue reading

6 ways to get started in community management

Following on from my previous post on the network journalist role, and as part of a wider experiment around the 5 roles in an investigations team, I wanted to flesh out what exactly a community editor role means when adopted as part of a journalism project.

First I need to add a disclaimer: the terms “community editor” (CE) and “community manager” (CM) are used to refer to a very wide range of jobs in a number of industries. I’m not sure what distinction there is – if any – but my hunch is that the title ‘community editor’ has been overtaken by its ‘manager’ variation because it rightly places the focus more on the community than its content.

Even within journalism, the role can vary enormously. This is partly because the communities themselves, and the challenges that they represent, differ so much. For example:

  • A mass market – anonymous and diverse, which a CM must try to somehow ‘convert’ into one or more healthy niche communities
  • No community – the CM is asked to ‘build it from scratch’
  • The CM’s website(s) already have active communities. The CM’s role is to maintain, support, and further develop those.
  • Communities exist, but not on the website(s) of the CM’s employer – the CM is asked to engage with those (this is more of a Community Editor role)

This post will be dealing with the last situation, which is the one in which most journalists first find themselves: with neither a platform nor a community.

With that out of the way, here are 6 things I think an individual can do as part of their first foray into community management/editing as a journalist:

1. Know where the communities are

This seems like a no-brainer but it’s all too easy to miss communities because you can’t find any evidence of them on Twitter or Facebook.

Along with specialist social networks (LinkedIn for professionals; MySpace for musicians; even profession-specific networks such as Doctors.net), there are forums, wikis, mailing lists, and various other places where people gather to share information and support.

The social media prism by Brian Solis (shown below) is one useful tool for checking if you’ve covered every possible angle on this front. For example: have you thought of looking for your community on Flickr? Last.fm? Digg? A locally popular platform? (LiveJournal dominates in eastern Europe, for example, while QQ is China’s answer to Twitter, Mixi is Japan‘s answer to Facebook, and Orkut and Hi5 have a healthy userbase in places like Brazil and India)

social media prism

Look for the communities in every corner of the net – don’t expect them to come pre-labelled. For example, the forums of local football clubs and local newspapers often contain corners devoted to topics other than football and news. And follow people as well as topics: if you find someone in your field, search for their username across the web and see what other places they contribute to.

One other place to look: the physical world. Live events, conferences, meetups and other gatherings are ideal places to build relationships with members of the community – as well as a great opportunity for providing live coverage online that will lead you to others, and others to you.

2. Look for problems to solve

Once you’ve identified and are following the community, try to find your best role within it. Remember that the community is not here to serve you: barging in and asking for case studies will get you the same response as if you did that in any physical social space: blank stares and muttered insults.

The simplest way to find a place in a community is through solving problems.

Listen for questions that people are asking, or complaints that they make. A key skill of a journalist is to find the answers to questions, or get responses to complaints – so that’s likely to be one way you can contribute.

Those answers and responses, of course, also make for good evergreen content (which can help you attract other members of the community), so cross-post them on your blog as well as on the platforms where the community gathers.

You might also see the need for physical meetups or other events – don’t be afraid to get stuck in and organise one.

3. Be interested – listen and ask questions

You will be both a better journalist and community editor if you listen as much as possible, and ask when you want to hear more about something.

It doesn’t have to be newsworthy – in fact, it’s sometimes better when it’s not – because often it’s an understanding of the small details and complex context which makes for better journalism and, by extension, better – and more – relationships with contacts.

4. Create content out of the process of discovery

As you explore a community a good practice to adopt is to record your research in ways that make it easier for others to engage with the community too. This helps you see what is interesting about a community, as well as creating content which can help contacts find you.

Examples of typical content created from the process of community research include:

  • ‘Top 20 people in [an industry] to follow on Twitter’
  • ‘The best forums for [your field or issue]‘
  • ‘The hottest discussions about the [issue] right now’
  • ‘Where do [profession] go for advice on [problem]?’
  • ‘The best blogs about [your field/issue]‘
  • ‘Forums roundup: what people are saying about [issue/question]‘

You may need to make a choice on where to post this content, especially if the community is not a big user of blogs. Don’t publish in a way that is disconnected from the community that you are supposed to be serving: at the very least cross-publish to the platforms where discussion is healthiest. Don’t spam shared spaces with links to external content.

You might also profile members of the community, or – at a later stage – create something that pulls together profiles, points of view, or experiences. For example: this Times Educational Supplement piece collects excluded pupils’ experiences (incomplete version online); We Are The 99 Percent uses Tumblr to pull together the experiences that inspired a protest movement; and Spitalfields Life seeks to document the places and people of the area, while this Guardian interactive allows you to explore the voices of 100 NHS workers on health reforms.

5. Link, retweet, attribute and comment

Finally, it’s important to link to content from your community as often as possible. This does two things: firstly, it demonstrates good attribution and demonstrates that you are not looking to take credit for yourself which belongs to others. And secondly, it makes other people aware of your work: a link to another blog generates a ‘pingback’ which alerts the author to your piece. Twitter users are notified if their tweet is retweeted by you, Facebook users if you ‘like’ their update, and so on. Comments are an extension of the same principle of acknowledgement.

Linking – or ‘linkblogging’ – is also the simplest way to begin engaging with a field and its communities, and a good habit to get into if you want to understand an area and get in the habit of keeping up to date with it. For more on that, 7 ways to follow a field… is a good guide.

6. Read about community management

As you gain in confidence and reputation, you may find yourself doing more and more in your community. Community management is, to my mind, one of the hardest roles in online journalism to do well, and the more insights you can gather from others, the better prepared you will be.

This list of resources from FeverBee is as good as they come. You should also follow blogs in the field – that list contains a section on those, but if you just want 5 to start with, here’s a bundle to subscribe to.

PS: If you want to see explanations of job descriptions of the CM, and other roles such as social media manager, this post by Blaise Grimes-Viort does a very good job of trying to unpick the subtle differences and links to typical job descriptions. More on traits of community managers at ReadWriteWebThe Constant Observer and Business2Community.

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data: http://api.opencorporates.com/companies/gb/00102498/data

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form http://api.opencorporates.com/data/2601371, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

  • value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
  • value.parseJson().datum.attributes.mark_text
  • value.parseJson().datum.attributes.representative_address_lines
  • value.parseJson().datum.attributes.representative_name_lines
  • value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
http://www.wipo.int/romarin/images/XX/YY/XXYYNN.typ
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘http://www.wipo.int/romarin/images/&#8217; + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

null null
null null
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
“[“Murgitroyd & Company”]“ “[“Scotland House,”,”165-169 Scotland Street”,”Glasgow G5 8PL”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“
“[“BP Group Trade Marks”]“ “[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“ROBERT WILLIAM BOAD”,”BP p.l.c. – GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“MURGITROYD & COMPANY”]“ “[“17 Lansdowne Road”,”Croydon, Surrey CRO 2BX”]“
“[“A.C. CHILLINGWORTH”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON EC2M 7BA”]“
“[“BP Group Trade Marks”]“ “[“20 Canada Square, Canary Wharf”,”London E14 5NJ”]“
“[“ROBERT WILLIAM BOAD”,”GROUP TRADE MARKS”]“ “[“Britannic House,”,”1 Finsbury Circus”,”LONDON, EC2M 7BA”]“
“[“BP GROUP TRADE MARKS”]“ “[“20 Canada Square,”,”Canary Wharf”,”London E14 5NJ”]“

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.

Where chasing traffic meets “important” journalism: Gawker’s experiment

Nieman reports on a fascinating experiment in traffic-chasing content from Gawker which provides all sorts of insights into just how valuable that content is, and where it sits in the wider editorial mix. Here’s what they did:

“Each day for two weeks, a [different] single staff writer would be assigned “traffic-whoring duty.”” Continue reading