Monthly Archives: April 2012

From Paywalls and Attention Walls to Data Disclosure Walls and Survey Walls

Is it really only a couple years since the latest, widely quoted, iteration of the idea that “If you are not paying for it, you’re not the customer; you’re the product being sold” was first posted about web economics?

[Notes for folk visiting this site from a referral thread on metafilter]

Prompted by the recent release of new Google product that presents site visotrs with a paid for, and revenue generating, survey before they can see the site’s content, here are a few observations around that idea…

First, let’s just consider the paywall for a moment. Paywalls on the web prevent you from accessing content without payment or some other form of financial subscription. I’m guessing the term was originally coined as a corruption of the term “firewall”, which in a network sense is a component that either allows or prevents network traffic from passing from one device to another based on a set of rules. For example, a firewall might blog traffic from a .xxx domain or particular IP address. [OpenLearn: What are firewalls?]

If a user can be tracked across pageviews within a single visit to a site, or across multiple visits to the site, the paywall may be configured to allow the user to see so many items for free per visit, or per month, before they are required to pay.

Paywalls, can come in a literal form – you pays your money and you gets your content – or at one step remove: you hand over your data, and it’s used to charge an advertiser a premium rate for selling ads to you as a known entity, or by selling your data to a third party. This is the sense in which you are the product. So how does it work?

If you’ve watched an online video recently, whether on a site such as Youtube, or a (commercial) watch again TV service such as ITV Player or 4od, you may way have been exposed to a pre-roll advert before the video you want to watch begins. Many commercial media websites, too, load first with an ad containing lightbox that overlays the article you actually want to read, often with a “Skip Ad” action required if you want to bail out of the ad early.

These ads are one the ways these sites generate income, of course, income that at the end of the day helps pay to keep the site running.

The price paid for these ads typically depends on the size and “quality” or specificity, as well as the size, of the audience the site delivers to the advertiser (that is, the audience segment: [OpenLearn: Market segmentation and targeting]). Sites (and magazines, and TV programmes) all have audiences with a particular demographics and set of interests, and these specialist or well defined audience groups are what the publisher sells to the advertiser.

(From years ago, I remember a bid briefing for a science outreach funding programme where we were told we would be marked down severely if we said the intended audience for our particular projects was “the general public”. What they wanted to know was what audience we were specifically going to hit, and how we were going to tune our projects to engage and inform that particular audience. Same story.)

At the end of the day, adverts are used to persuade audiences to purchase product. So you give data to a publisher, they use that to charge an advertiser a higher rate for being able to put ads in front of particular audiences who are presumably likely to to buy the advertiser’s wares if nudged appropriately, and you buy the product. With cash that pays the advertiser who bought the ad from the publisher who sold your details to them. So you still paid to access that content. With a “free gift” in the form of the goods you bought from the advertiser who bought the ads from the publisher that were placed in front of a particular audience on the basis of the data you gave to the publisher.

Let’s reconsider the paywall mediated sites, for a moment, where for example you get 10 free articles a month, 20 if you register, unlimited if you pay. The second option requires that you register some personal information with the site, such as an email address, or date of birth. You get +x article views on the site “for free” in exchange for your giving the website y pieces of data. In exchange for those free views, you have had to give something in return. You have bought those extra “free” views with your data. The money the site would have got from you if you had paid with cash is replaced by income generated from your data. For example, if the publisher sells adverts at a high price to audiences in the 17-25 range, and you are in that age range, the disclosure of your birthdate allows you to be put into that audience group which is sold to advertisers as such. If you handed over your email address, that can also be sold on to email marketers; if you had to verify that address by clicking on a link emailed to it, it becomes more valuable because it’s more likely to be a legitimate email address. More value can be added to the email address if it is sold as a verified email address belonging to a 17-25 year old, and so on.

Under the assumption that by paying attention to an ad you become more likely to buy a product, or tell someone about the product who is likely to buy it, the paywall essentially becomes replaced by an “attention based, indirect paywall”.

A new initiative by Google ramps up the data-exchange based paywall even further: Google Consumer Surveys. Marketing magazine describes it as follows (Google’s new survey tool: DIY research tool and pay wall alternative):

‘Google Consumer Surveys’ is a survey tool which blocks sections of webpages or articles until the reader answers a question, paying the website owner five cents per response when they do. The service is being billed as an alternative revenue model for publishers considering a pay wall strategy, launching with a handful of news partners last week.

The service works as a DIY research tool, charging users 10 cents per response to questions of the their choice. Buyers of the research have the option to pay an extra 40 cents per response to target sub-populations based on gender, age and location and can target more specific audiences, such as dog owners, with a screening and follow-up question option that costs an additional 50 cents per response.

So let’s unpick that: rather than running ads, the publisher runs a survey. They essentially get paid (via Google) for running the survey by someone who pays Google to run the survey. You hand over your data to the survey company who pays Google who pays the publisher for delivering you, the survey subject. Rather than targeting ads at you, Google targets you as a survey subject, mediated by the publisher who delivers a particular audience demographic; (rather than using sites to target particular audiences, I guess Google will end up using knowledge about audiences to ensure that surveys are displayed to a wide range of subjects, thus ensuring a fair sample. Which means, as Marketing mag suggests, “the questions [will] potentially having nothing to do with the site’s content…”). Rather trying to influence you as a purchaser by presenting you with an ad, in the hope that you will return cash to the person who orginally paid for the ad by buying their wares, disclosure about your beliefs is now the currency. (I need to check about the extent to which: a) Google can in principle and in fact reconcile survey results with a user ID; b) the extent to which Google provides detailed information back to the survey commissioner about the demographics and identity of the survey subjects. Marketing mag suggests “[t]o pre-empt any privacy fears, the search giant is emphasising that all surveys will be completely anonymous and that Google will not use any data collected for its own ad targeting.” So that’s all right then. But Google will presumably know that it has served you x ads and y surveys, if not what answers you gave to survey qustions.).

As well as productising yourself, as sold by publishers to advertisers, by virtue of handing over your data, you’ve also paid in a couple of other senses too – with your attention and with your time. Your attention and your demographic details (that is, your propensity to buy and, at the end of the day, your purchasing power (i.e. your cash) are what you exchange for the “free” content; if your time represents your ability to use that time generating your own income, there may also be an opportunity cost to you (that is, you have not generated 1 hour’s income doing paid for work because you have spent 1 hour watching ads). The cost to you is a loss of income you may otherwise have earned by using that time for paid work.

A couple of the missing links in advertising, of course, are reliable feedback about: 1) whether anyone actually pays attention to your ad; 2) whether they act on it. Google cracked part of action puzzle, at least in terms of ad payments, by coming up with an advertising charging model that got advertisers to bid for ad placements and then only pay if someone clicked through on the ad (Pay-per-click, PPC advertsing) rather than using the original display oriented, “impression based” advertising, where advertisers would pay for so many impressions of their advert (CPM, cost per mille (i.e. cost per thousand impressions).

It seems that Google are now trying to put CPM based metrics on a firmer footing with a newly announced metric, Active View (Making the Web Work for Brand Marketers).

Advertisers have long looked for insight into whether consumers saw an ad on page 145 of a magazine, or switched the channel during a TV commercial break. It’s similar online, so we’re rolling out a technology [Active View], … that can count “viewed” impressions (as defined by the IAB’s proposed standard, this is a display ad that is at least 50% viewable on the screen for at least one second).

… Active View data will be immediately actionable — advertisers will be able to pay only for for viewed impressions.

They’re also looking to improve feedback on the demographics of users who actually view an advert:

Active GRP: GRP, or a gross rating point, is at the heart of offline media measurement. For example, when a fashion brand wants their TV campaign to reach 2 million women with two ads each, they use GRP to measure that. We’re introducing a new version of this for the web: Active GRP. …

… Active GRP is calculated by a statistical model that combines aggregated panel data and anonymous user data (either inferred or user-provided), and will work in conjunction with Active View to measure viewed impressions. This approach overcomes problems of potential panel skewing and reliance on a single data source. This approach also has the advantage of never using personally identifiable information, not sharing user data with third parties, and enabling users, through Google’s Ads Preferences Manager, to opt-out.

Both these announcements were made in the context of Google’s Brand Activate initiative.

Facebook, too, is looking to improve it’s reporting – and maybe its ad targeting? – to advertisers. Although I can’t offhand find an original Facebook source, TechCrunch (Facebook Ads Can Now Be Optimized To Drive Any On-Facebook Action, Such As In-App Purchases, Shares, Offer Claims), Mashable (Facebook’s Analytics Tool for Ads Will Soon Measure Actions Other Than ‘Likes’) et al are reporting on a Facebook briefing that described how advertisers will be able to view reports describing the downstream actions taken by people who have viewed a particular advert. The Facebook article also suggests that the likelihood of a user performing a particular action might form part of the targeting criteria (“today Facebook begins allowing advertisers using its API to ask it to show their ads to people most likely to take any specific post-click action on the social network, such as sharing a brand’s content to the news feed, buying virtual goods in their apps, or redeeming one of the new Facebook Offers at a local brick-and-mortar store”).

So now, it seems that the you that is the product may well soon include your (likely) actions…

See also: Expectations Matter, Even If You’re Not ‘A Customer’ which links in to a discussion about what reasonable expectations you might have as a user of a “free” service.

And this: Contextual Content Delivery on Higher Ed Websites Using Ad Servers, on something of Google’s ad targeting capacity as of a couple of years ago…

[Notes: I would reply in the thread but I don’t want to have to pay cash for the, erm, privilege of doing so… I also appreciate that none of these ideas are necessarily original, and I recognise that the model applies to TV, radio, print or whatever other content carrier and container you care to talk about… I suspect that Blue Beetle isn’t actually the source of the “you are the product” slogan this time round, anyway, (in recent months, Wired probably is) although many search engines lead that way. (So for example, it’s easy to find earlier, similarly pithy, expressions of the same sentiment in the web context all over the place… For example, this 2009 post; or this one). And not that you’ll care, this blog is my notebook, and these notes are just me scribbling down some context around the Google survey product (the post construction/writing style reflects that) #trollFeeding PS Since everybody knows that 1+1=2, I figure we probably don’t need to teach it anymore #deadHorseFlogging #gettingChildishNow #justLikeAMetaFilterThread]

When data goes bad

3 Replies

Image by Lauren York

Data is so central to the decision-making that shapes our countries, jobs and even personal lives that an increasing amount of data journalism involves scrutinising the problems with the very data itself. Here’s an illustrative list of when bad data becomes the story – and the lessons they can teach data journalists:

Deaths in police custody unrecorded

This investigation by the Bureau of Investigative Journalism demonstrates an important question to ask about data: who decides what gets recorded?

In this case, the BIJ identified “a number of cases not included in the official tally of 16 ‘restraint-related’ deaths in the decade to 2009 … Some cases were not included because the person has not been officially arrested or detained.”

As they explain:

“It turns out the IPCC has a very tight definition of ‘in custody’ – defined only as when someone has been formally arrested or detained under the mental health act. This does not include people who have died after being in contact with the police.

“There are in fact two lists. The one which includes the widely quoted list of sixteen deaths in custody only records the cases where the person has been arrested or detained under the mental health act. So, an individual who comes into contact with the police – is never arrested or detained – but nonetheless dies after being restrained, is not included in the figures.

“… But even using the IPCC’s tightly drawn definition, the Bureau has identified cases that are still missing.”

Cross-checking the official statistics against wider reports was key technique. As was using the Freedom of Information Act to request the details behind them and the details of those “ who died in circumstances where restraint was used but was not necessarily a direct cause of death”.

Cooking the books on drug-related murders

Drug related murders in Mexico
Cross-checking statistics against reports was also used in this investigation by Diego Valle-Jones into Mexican drug deaths:

“The Acteal massacre committed by paramilitary units with government backing against 45 Tzotzil Indians is missing from the vital statistics database. According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

Diego also used the Benford’s Law technique to identify potentially fraudulent data, which was also used to highlight relationships between dodgy company data and real world events such as the dotcom bubble and deregulation.

Poor records mean no checks

Detective Inspector Philip Shakesheff exposed a “gap between [local authority] records and police data”, reported The Sunday Times in a story headlined ‘Care home loses child 130 times‘:

“The true scale of the problem was revealed after a check of records on police computers. For every child officially recorded by local authorities as missing in 2010, another seven were unaccounted for without their absence being noted.”

Why is it important?

“The number who go missing is one of the indicators on which Ofsted judges how well children’s homes are performing and the homes have a legal duty to keep accurate records.

“However, there is evidence some homes are failing to do so. In one case, Ofsted gave a good report to a private children’s home in Worcestershire when police records showed 1,630 missing person reports in five years. Police stationed an officer at the home and pressed Ofsted to look closer. The home was downgraded to inadequate and it later closed.

“The risks of being missing from care are demonstrated by Zoe Thomsett, 17, who was Westminster council’s responsibility. It sent her to a care home in Herefordshire, where she went missing several times, the final time for three days. She had earlier been found at an address in Hereford, but because no record was kept, nobody checked the address. She died there of a drugs overdose.

“The troubled life of Dane Edgar, 14, ended with a drugs overdose at a friend’s house after he repeatedly went missing from a children’s home in Northumberland. Another 14-year-old, James Jordan, was killed when he absconded from care and was the passenger in a stolen car.”

Interests not registered

When there are no formal checks on declarations of interest, how can we rely on it? In Chile, the Ciudadano Inteligente Fundacion decided to check the Chilean MPs’ register of assets and interests by building a database:

“No-one was analysing this data, so it was incomplete,” explained Felipe Heusser, executive president of the Fundacion. “We used technology to build a database, using a wide range of open data and mapped all the MPs’ interests. From that, we found that nearly 40% of MPs were not disclosing their assets fully.”

The organisation has now launched a database that “enables members of the public to find potential conflicts of interest by analysing the data disclosed through the members’ register of assets.”

Data laundering

Tony Hirst’s post about how dodgy data was “laundered” by Facebook in a consultants report is a good illustration of the need to ‘follow the data’.

We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make its way into a policy briefing.”

“Things just don’t add up”

In the video below Ellen Miller of the Sunlight Foundation takes the US government to task over the inconsistencies in its transparency agenda, and the flawed data published on its USAspending.gov – so flawed that they launched the Clearspending website to automate and highlight the discrepancy between two sources of the same data:

Key budget decisions made on useless data

Sometimes data might appear to tell an astonishing story, but this turns out to be a mistake – and that mistake itself leads you to something much more newsworthy, as Channel 4′s FactCheck foundwhen it started trying to find out if councils had been cutting spending on Sure Start children’s centres:

“That ought to be fairly straightforward, as all councils by law have to fill in something called a Section 251 workbook detailing how much they are spending on various services for young people.

“… Brent Council in north London appeared to have slashed its funding by nearly 90 per cent, something that seemed strange, as we hadn’t heard an outcry from local parents.

“The council swiftly admitted making an accounting error – to the tune of a staggering £6m.”

And they weren’t the only ones. In fact, the Department for Education admitted the numbers were “not very accurate”:

“So to recap, these spending figures don’t actually reflect the real amount of money spent; figures from different councils are not comparable with each other; spending in one year can’t be compared usefully with other years; and the government doesn’t propose to audit the figures or correct them when they’re wrong.”

This was particularly important because the S251 form “is the document the government uses to reallocate funding from council-run schools to its flagship academies.”:

“The Local Government Association (LGA) says less than £250m should be swiped from council budgets and given to academies, while the government wants to cut more than £1bn, prompting accusations that it is overfunding its favoured schools to the detriment of thousands of other children.

“Many councils’ complaints, made plain in responses to an ongoing government consultation, hinge on DfE’s use of S251, a document it has variously described as “unaudited”, “flawed” and”not fit for purpose”.

No data is still a story

Sticking with education, the TES reports on the outcome of an FOI request on the experience of Ofsted inspectors:

“[Stephen] Ball submitted a Freedom of Information request, asking how many HMIs had experience of being a secondary head, and how many of those had led an outstanding school. The answer? Ofsted “does not hold the details”.

““Secondary heads and academy principals need to be reassured that their work is judged by people who understand its complexity,” Mr Ball said. “Training as a good head of department or a primary school leader on the framework is no longer adequate. Secondary heads don’t fear judgement, but they expect to be judged by people who have experience as well as a theoretical training. After all, a working knowledge of the highway code doesn’t qualify you to become a driving examiner.”

“… Sir Michael Wilshaw, Ofsted’s new chief inspector, has already argued publicly that raw data are a key factor in assessing a school’s performance. By not providing the facts to back up its boasts about the expertise of its inspectors, many heads will remain sceptical of the watchdog’s claims.”

Men aren’t as tall as they say they are

To round off, here’s a quirky piece of data journalism by dating site OkCupid, which looked at the height of its members and found an interesting pattern:

Male height distribution on OKCupid

“The male heights on OkCupid very nearly follow the expected normal distribution—except the whole thing is shifted to the right of where it should be.

“Almost universally guys like to add a couple inches. You can also see a more subtle vanity at work: starting at roughly 5′ 8″, the top of the dotted curve tilts even further rightward. This means that guys as they get closer to six feet round up a bit more than usual, stretching for that coveted psychological benchmark.”

Do you know of any other examples of bad data forming the basis of a story? Please post a comment – I’m collecting examples.

UPDATE (April 20 2012): A useful addition from Simon Rogers: Named and shamed: the worst government annual reports explains why government department spending reports fail to support the Government’s claimed desire for an “army of armchair auditors”, with a list of the worst offenders at the end.

Also:

This post on the lack of data on deaths from legal highs, by some of my students at City University.
Sex trafficking: a story of data gone wrong, which is the source of the opening image for this post (by Lauren York, another student of mine)
Chicago police crash reports are full of errors.
Mental health spending reported at local authority and central level don’t correlate: “Some offered explanations that are reasonable, but probably opaque to a layperson. Cheshire West and Chester, for example, said that their own figures were the “direct budget” for mental health services, whereas the DCLG revenue accounts give costs on a “statutory accounting basis”. Others pointed to the inclusion or exclusion of services for the over-65s as a reason for discrepancies. Still others confessed to simple errors – while several more treated the request for clarification as a new FOI and are yet to respond.”
No data doesn’t mean no story in the HMI blog

Tales of a Video Blogger

2 Replies

In a guest post for OJB, cross-posted from Putney Debater, Michael Chanan explores his experiences of video blogging for the New Statesman and how it differs from conventional documentary.

Being written for presentation at ‘Marx at the Movies’, these notes address the topic from an angle which is rarely treated in film and video scholarship, that of the peculiar labour process and mode of production involved.

When I started video blogging on the New Statesman, I don’t know if either the NS or myself quite knew what to expect. The main reason for not knowing: it was December 2010, it was clear that something momentous going on, that the protest movement was building, and the idea I had, which the NS agreed to go with, was simple enough: to go out and film stuff that was happening from a sympathetic point of view, and thus, almost week by week, build up a kind of ongoing documentary record of the events. I was thinking in terms of Glauber Rocha’s formula for Cinema Novo in Brazil—to go and make films with a camera in the hand and an idea in the head. I also had the idea from the outset of bringing these blogs together sometime later into a single long documentary (which duly appeared as Chronicle of Protest). Continue reading →

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
except:
    ockey=''

rurl='http://opencorporates.com/reconcile/gb?query=tesco'
#note - the opencorporates api also offers a search:  companies/search
entities=simplejson.load(urllib.urlopen(rurl))

def getOCcompanyData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/data'+'?api_token='+ockey
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?per_page=100&api_token='+ockey
    tmpdata=simplejson.load(urllib.urlopen(ocurl))
    ocdata=tmpdata['filings']
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        page=str(tmpdata['page']+1)
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
        ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?page='+page+'&per_page=100&api_token='+ockey
        tmpdata=simplejson.load(urllib.urlopen(ocurl))
        ocdata=ocdata+tmpdata['filings']
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    ddata={}
    ddata['ocname']=ocname
    ddata['ocid']=ocid
    ddata['fdesc']=ffiling["description"]
    ddata['fdirector']=director
    ddata['fdate']=ffiling["date"]
    ddata['fid']=ffiling["id"]
    ddata['ftyp']=ffiling["filing_type"]
    ddata['fcode']=ffiling["filing_code"]
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            desc=filing["filing"]["description"]
            director=desc.replace('DIRECTOR APPOINTED ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            desc=filing["filing"]["description"]
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)

for entity in entities['result']:
    ocid=entity['id']
    ocname=entity['name']
    filings=getOCfilingData(ocid)
    logDirectors(ocname,ocid,filings)

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)

DG=nx.DiGraph()

directors=[]
companies=[]
for row in data:
    if row['fdirector'] not in directors:
        directors.append(row['fdirector'])
        DG.add_node(directors.index(row['fdirector']),label=row['fdirector'],name=row['fdirector'])
    if row['ocname'] not in companies:
        companies.append(row['ocname'])
        DG.add_node(row['ocid'],label=row['ocname'],name=row['ocname'])   
    DG.add_edge(directors.index(row['fdirector']),row['ocid'])

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(DG)

print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

The future of open journalism: how journalists need to step up their game

1 Reply

Illustration by Leonard Leslie Brooke, from Wikimedia Commons

Cross-posted from XCity Magazine

The future of journalism, according to The Guardian’s ‘3 Little Pigs’ film, is “open journalism”. Users are becoming part of every element of news production. The newsroom no longer has walls.

If that is going to happen then journalists need to huff, and puff, and blow down three particular houses of our own: our preconceptions around the sources that we use online; around why people contribute to the news process; and about how we protect our sources. Continue reading →

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});

    google.setOnLoadCallback(drawTable);

    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?
    //json_table.setColumns([0,1,2,3,4,5,6,7])

    var formatter = new google.visualization.PatternFormat('<a href="http://www.bbc.co.uk/programmes/{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'
      }
    });

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);

    }

The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL='http://opencorporates.com/reconcile?query='+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    try:
        recData=simplejson.load(urllib.urlopen(ocrecURL))
    except:
        recData={'result':[]}
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:
            record['ocData']=recData['result'][0]
            record['ocID']=recData['result'][0]['uri']
            record['ocName']=recData['result'][0]['name']

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

Presentations translated into Arabic: guides for citizen journalists

1 Reply

Late last year I was asked to put together some presentations giving advice on verifying information, finding people and stories online, ethics, and news values. These were translated by Anas Qtiesh into Arabic as part of CheckDesk, a project to support Middle East citizen journalists created by Meedan at Birmingham City University.

The materials are collected at ArabCitizenMedia.org. I’ve linked to each presentation above.

Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API

In Visualising Networks in Gephi via a Scraperwiki Exported GEXF File I gave an example of how we can publish arbitrary serialised output file formats from Scraperwiki using the GEXF XML file format as a specific example. Of more general use, however, may be the ability to export Scraperwiki data using the Google visualisation API DataTable format. Muddling around the Google site last night, I noticed the Google Data Source Python Library that makes it easy to generate appropriately formatted JSON data that can be consumed by the (client side) Google visualisation library. (This library provides support for generating line charts, bar charts, sortable tables, etc, as well as interactive dashboards.) A tweet to @frabcus questioning whether the gviz_api Python library was available as a third party library on Scraperwiki resulted in him installing it (thanks, Francis:-), so this post is by way of thanks…

Anyway, here are a couple of examples of how to use the library. The first is a self-contained example (using code pinched from here) that transforms the data into the Google format and then drops it into an HTML page template that can consume the data, in this case displaying it as a sortable table (GViz API on scraperwiki – self-contained sortable table view [code]):

Of possibly more use in the general case is a JSONP exporter (example JSON output (code)):

Here’s the code for the JSON feed example:

import scraperwiki
import gviz_api

#Example of:
## how to use the Google gviz Python library to cast Scraperwiki data into the Gviz format and export it as JSON

#Based on the code example at:
#http://code.google.com/apis/chart/interactive/docs/dev/gviz_api_lib.html

scraperwiki.sqlite.attach( 'openlearn-units' )
q = 'parentCourseCode,name,topic,unitcode FROM "swdata" LIMIT 20'
data = scraperwiki.sqlite.select(q)

description = {"parentCourseCode": ("string", "Parent Course"),"name": ("string", "Unit name"),"unitcode": ("string", "Unit Code"),"topic":("string","Topic")}

data_table = gviz_api.DataTable(description)
data_table.LoadData(data)

json = data_table.ToJSon(columns_order=("unitcode","name", "topic","parentCourseCode" ),order_by="unitcode")

scraperwiki.utils.httpresponseheader("Content-Type", "application/json")
print 'ousefulHack('+json+')'

I hardcoded the wraparound function name (ousefulHack), which then got me wondering: is there a safe/trusted/approved way of grabbing arguments out of the URL in Scraperwiki so this could be set via a calling URL?

Anyway, what this shows (hopefully) is an easy way of getting data from Scraperwiki into the Google visualisation API data format and then consuming either via a Scraperwiki view using an HTML page template, or publishing it as a Google visualisation API JSONP feed that can be consumed by an arbitrary web page and used direclty to drive Google visualisation API chart widgets.

PS as well as noting that the gviz python library “can be used to create a google.visualization.DataTable usable by visualizations built on the Google Visualization API” (gviz_api.py sourcecode), it seems that we can also use it to generate a range of output formats: Google viz API JSON (.ToJSon), as a simple JSON Response (. ToJSonResponse), as Javascript (“JS Code”) (.ToJSCode), as CSV (.ToCsv), as TSV (.ToTsvExcel) or as an HTML table (.ToHtml). A ToResponse method (ToResponse(self, columns_order=None, order_by=(), tqx=””)) can also be used to select the output response type based on the tqx parameter value (out:json, out:csv, out:html, out:tsv-excel).

PPS looking at eg https://spreadsheets.google.com/tq?key=rYQm6lTXPH8dHA6XGhJVFsA&pub=1 which can be pulled into a javascript google.visualization.Query(), it seems we get the following returned:
google.visualization.Query.setResponse({"version":"0.6","status":"ok","sig":"1664774139","table":{ "cols":[ ... ], "rows":[ ... ] }})
I think google.visualization.Query.setResponse can be a user defined callback function name; maybe worth trying to implement this one day?

Visualising Networks in Gephi via a Scraperwiki Exported GEXF File

How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.

A quick bit of backstory first…

A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue…

So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:

The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).

As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:

The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).

From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)

I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)

To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)

Here’s how I built the graph:

import scraperwiki
import urllib
import networkx as nx

scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data = scraperwiki.sqlite.select(q)

G=nx.Graph()

topics=[]
for row in data:
    G.add_node(row['unitcode'],label=row['unitcode'],name=row['name'],parentCC=row['parentCourseCode'])
    topic=row['topic']
    if topic not in topics:
        topics.append(topic)
    tID=topics.index(topic)
    topicID='topic_'+str(tID)
    G.add_node(topicID,label=topic,name=topic)     
    G.add_edge(topicID,row['parentCourseCode'])
    G.add_edge(row['unitcode'],row['parentCourseCode'])

Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.

Here’s the code I hacked out as a result of that little investigation:

import networkx.readwrite.gexf as gf

writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(G)

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.

Here’s the full code for the view.

And here’s the GEXF view:

Save this file with a .gexf suffix and you can then open the file directly into Gephi.

Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.

PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.