Category Archives: online journalism

Why your mark doesn’t matter (and why it does)

It’s that time of year when students get their marks and with them, sometimes, disappointment, frustration or outright confusion. These emotions tend to arise, I think, because students and academics often have very different perceptions of what marks mean.

So here are four reasons why your mark does not matter in the way you think it does – as well as some pointers to making sure things are kept in perspective. Continue reading

Postcards from a Text Processing Excursion

It never ceases to amaze me how I lack even the most basic computer skills, but that’s one of the reasons I started this blog: to demonstrate and record my fumbling learning steps so that others maybe don’t have to spend so much time being as dazed and confused as I am most of the time…

Anyway, I spent a fair chunk of yesterday trying to find a way of getting started with grappling with CSV data text files that are just a bit too big to comfortably manage in a text editor or simple spreadsheet (so files over 50,000 or so rows, up to low millions) and that should probably be dumped into a database if that option was available, but for whatever reason, isn’t… (Not feeling comfortable with setting up and populating a database is one example…But I doubt I’ll get round to blogging my SQLite 101 for a bit yet…)

Note that the following tools are Unix tools – so they work on Linux and on a Mac, but probably not on Windows unless you install a unix tools package (such as GnuWincoreutils and sed, which look good for starters…). Another alternative would be to download the Data Journalism Developer Studio and run it either as a bootable CD/DVD, or as a virtual machine using something like VMWare or VirtualBox.

All the tools below are related to the basic mechanics of wrangling with text files, which include CSV (comma separated) and TSV (tab separated) files. Your average unix jockey will look at you with sympathetic eyes if you rave bout them, but for us mere mortals, they may make life easier for you than you ever thought possible…

[If you know of simple tricks in the style of what follows that I haven’t included here, please feel free to add them in as a comment, and I’ll maybe try to work then into a continual updating of this post…]

If you want to play along, why not check out this openurl data from EDINA (data sample; a more comprehensive set is also available if you’re feeling brave: monthly openurl data).

So let’s start at the beginning and imagine your faced with a large CSV file – 10MB, 50MB, 100MB, 200MB large – and when you try to open it in your text editor (the file’s too big for Google spreadsheets and maybe even for Google Fusion tables) the whole thing just grinds to a halt, if doesn’t actually fall over.

What to do?

To begin with, you may want to take a deep breath and find out just what sort of beast you have to contend with. You know the file size, but what else might you learn? (I’m assuming the file has a csv suffix, L2sample.csv say, so for starters we’re assuming it’s a text file…)

The wc (word count) command is a handy little tool that will give you a quick overview of how many rows there are in the file:

wc -l L2sample.csv

I get the response 101 L2sample.csv, so there are presumably 100 data rows and 1 header row.

We can learn a little more by taking the -l linecount switch off, and getting a report back on the number of words and characters in the file as well:

wc L2sample.csv

Another thing that you might consider doing is just having a look at the structure of the file, by sampling the first few rows of it and having a peek at them. The head command can help you here.

head L2sample.csv

By default, it returns the first 10 rows of the file. IF we want to change the number of rows displayed, we can use the -n switch:

head -n 4 L2sample.csv

As well as the head command, there is the tail command; this can be used to peek at the lines at the end of the file:

tail L2sample.csv
tail -n 15 L2sample.csv

When I look at the rows, I see they have the form:

logDate	logTime	encryptedUserIP	institutionResolverID	routerRedirectIdentifier ...
2011-04-04	00:00:03	kJJNjAytJ2eWV+pjbvbZTkJ19bk	715781	ukfed ...
2011-04-04	00:00:14	/DAGaS+tZQBzlje5FKsazNp2lhw	289516	wayf ...
2011-04-04	00:00:15	NJIy8xkJ6kHfW74zd8nU9HJ60Bc	569773	athens ...

So, not comma separated then; tab separated…;-)

If you were to upload a tab separated file to something like Google Fusion Tables, which I think currently only parses CSV text files for some reason, it will happily spend the time uploading the data – and then shove it into a single column.

I’m not sure if there are column splitting tools available in Fusion Tables – there weren’t last time I looked, though maybe we might expect a fuller range of import tools to appear at some point; many applications that accept text based data files allow you to specify the separator type, as for example in Google spreadsheets:

I’m personally living in hope that some sort of integration with the Google Refine data cleaning tool will appear one day…

If you want to take a sample of a large data file and put into another smaller file that you can play with or try things out with, the head (or tail) tool provides one way of doing that thanks to the magic of Unix redirection (which you might like to think of as a “pipe”, although that has a slightly different meaning in Unix land…). The words/jargon may sound confusing, and the syntax may look cryptic, but the effect is really powerful: take the output from a command and shove it into a file.

So, given a CSV file with a million rows, suppose we want to run a few tests in an application using a couple of hundred rows. This trick will help you generate the file containing the couple of hundred rows.

Here’s an example using L2sample.csv – we’ll create a file containing the first 20 rows, plus the header row:

head -n 21 L2sample.csv > subSample.csv

See the > sign? That says “take the output from the command on the left, and shove it into the file on the right”. (Note that if subSample.csv already exists, it will be overwritten, and you will lose the original.)

There’s probably a better way of doing this, but if you want to generate a CSV file (with headers) containing the last 10 rows, for example, of a file, you can use the cat command to join a file containing the headers with a file containing the last 10 rows:

head -n 1 L2sample.csv > headers.csv
tail -n 20 L2sample.csv > subSample.csv
cat headers.csv subSample.csv > subSampleWithHeaders.csv

(Note: don’t try to cat a file into itself, or Ouroboros may come calling…)

Another very powerful concept from the Unix command line is the notion of | (the pipe). This lets you take the output from one command and direct it to another command (rather than directing it into a file, as > does). So for example, if we want to extract rows 10 to 15 from a file, we can use head to grab the first 15 rows, then tail to grab the last 6 rows of those 15 rows (count them: 10, 11, 12, 13, 14, 15):

head -n 15 L2sample.csv | tail -n 6 > middleSample.csv

Try to read in as an English phrase (the | and > are punctuation): take the the first [head] 15 rows [-n 15] of the file L2sample.csv and use them as input [|] to the tail command; take the last [tail] 6 lines [-n 6] of the input data and save them [>] as the file middleSample.csv.

If we want to add in the headers, we can use the cat command:

cat headers.csv middleSample.csv > middleSampleWithHeaders.csv

We can use a pipe to join all sorts of commands. If our file only uses a single word for each column header, we can count the number of columns (single words) by grabbing the header row and sending it to wc, which will count the words for us:

head -n 1 L2sample.csv | wc

(Take the first row of L2sample.csv and count the lines/words/characters. If there is one word per column header, the word count gives us the column count…;-)

Sometimes we just want to split a big file into a set of smaller files. The split command is our frind here, and lets us split a file into smaller files containing up to a know number of rows/lines:

split -l 15 L2sample.csv subSamples

This will generate a series of files named subSamplesaa, subSamplesab, …, each containing 15 lines (except for the last one, which may contain less…).

Note that the first file will contain the header and 14 data rows, and the other files will contain 15 data rows but no column headings. To get round this, you might want to split on a file that doesn’t contain the header. (So maybe use wc -l to find the number of rows in the original file, create a header free version of the data by using tail on one less than the number of rows in the file, then split the header free version. You might then one to use cat to put the header back in to each of the smaller files…)

A couple of other Unix text processing tools let us use a CSV file as a crude database. The grep searches a file for a particular term or text pattern (known as a regular expression, which I’m not going to cover much in this post… suffice to note for now that you can do real text processing voodoo magic with regular expressions…;-)

So for example, in out test file, I can search for rows that contain the word mendeley

grep mendeley L2sample.csv

We can also redirect the output into a file:

grep EBSCO L2sample.csv > rowsContainingEBSCO.csv

If the text file contains columns that are separated by a unique delimiter (that is, some symbol that is only ever used to separate the columns), we can use the cut command to just pull out particular columns. The cut command assumes a tab delimiter (we can specify other delimiters explicitly if we need to), so we can use it on our testfile to pull out data from the third column in our test file:

cut -f 3 L2sample.csv

We can also pull out multiple columns and save them in a file:

cut -f 1,2,14,17 L2sample.csv > columnSample.csv

If you pull out just a single column, you can sort the entries to see what different entries are included in the column using the sort command:

cut -f 40 L2sample.csv | sort

(Take column 40 of the file L2sample.csv and sort the items.)

We can also take this sorted list and identify the unique entries using the uniq command; so here are the different entries in column 40 of our test file:

cut -f 40 L2sample.csv | sort | uniq

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values.)

(The uniq command appears to make comparaisons between consecutive lines, hence the nee to sort first.)

The uniq command will also count the repeat occurrence of unique entries if we ask it nicely (-c):

cut -f 40 L2sample.csv | sort | uniq -c

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values along with how many times they appear in the column as a whole.)

The final command I’m going to mention here is magic search and replace operator called sed. I’m aware that this post is already over long, so I’ll maybe return to this in a later post, aside from giving you a tease of scome scarey voodoo… how to convert a tab delimited file to a comma separated file. One recipe is given by Kevin Ashley as follows:

sed 's/"/\"/g; s/^/"/; s/$/"/; s/ctrl-V<TAB>/","/g;' origFile.tsv > newFile.csv

(See also this related question on #getTheData: Converting large-ish tab separated files to CSV.)

Note: if you have a small amount of text and need to wrangle it on some way, the Text Mechanic site might have what you need…

This lecture note on Unix Tools provides a really handy cribsheet of Unix command line text wrangling tools, though the syntax does appear to work for me using some of the commands as given their (the important thing is the idea of what’s possible…).

If you’re looking for regular expression helpers (I haven’t really mentioned these at all in this post, suffice to say they’re a mechanism for doing pattern based search and replace, and which in the right hands can look like real voodoo text processing magic!), check out txt2re and Regexpal (about regexpal).

TO DO: this is a biggie – the join command will join rows from two files with common elements in specified columns. I canlt get it working properly with my test files, so I’m not blogging it just yet, but here’s a starter for 10 if you want to try… Unix join examples

FAQ: Self-regulation of online media

Another series of questions I’ve been asked – with my answers, published here because I don’t want to repeat myself…

1. You have written that people read blogs and other user generated content because they trust the person not the brand; they link or contribute to that content because ‘a journalist invested social capital’ and trust is related to their reputation, knowledge and connections. In a sense I guess reputation, knowledge and connections is what also underpins trust in curated content (e.g. Storyful). My question is whether this informal reliance on the dynamics of social capital is enough to differentiate what Storyful refers to as the ‘news from the noise’ both for ‘curated’ news and online journalism more widely. Or would, for example, a self-regulatory Code be helpful in making this differentiation transparent?

I think a more sophisticated way of looking at this is that people draw on a range of ‘signals’ to make a judgement about the trustworthiness of content: that includes their personal history of interactions with the author; the author’s formal credentials, including qualifications and employer; the author’s network; the author’s behaviour with others; numerous other factors including for example ratings by strangers, friends and robots; and of course the content itself – is the evidence transparent, the argument /narrative logical, etc.

A self-regulatory Code would add another signal – but one that could be interpreted either way: some will see it as a badge of credibility; others as a badge of ‘sell-out’ or pretence to authority. The history of previous attempts suggest it would not particularly succeed.

I’m by no means an expert on trust – would be interesting to explore more literature on that.

2. The PCC has welcomed online journalists to join it – is there interest among the online journalism community? I appreciate the level of antipathy expressed in the blogging community when this was raised late 2009, but I wonder whether the current consultation in relation to live electronic communications for court reporting by accredited journalists could raise the significance of notions of accreditation (through the PCC or independent to it) for online journalists.

Many people blog precisely because they feel that the press does not self-regulate effectively: in a sense they are competing with the PCC (one academic described bloggers as ‘Estate 4.5’). It has a bad reputation – to stand any change of attracting bloggers it would have to give them a genuine voice and incorporate many of the ethical concerns that they have about journalism. I don’t see that happening.

Accreditation is a key issue, however – or at least respectability to advertisers; some bloggers are moving away from calling themselves ‘blogs’ as they look for more ad revenue. But they are a minority: most do not rely on ad revenue. As access to courts and council meetings is explicitly widened, accreditation is less of a problem.

3. If on-line journalists were to consider self regulation what would be the key principles that might inform it: accuracy, fairness, privacy? What of harm and in particular protection of the under 18’s? Or are such notions irrelevant in the world of social capital?

This post – http://paulbradshaw.wpengine.com/2011/03/07/culture-clash-journalisms-ideology-vs-blog-culture/ – is a good summary of how I think the two compare. Most bloggers see themselves as fiercely ethical and you will frequently see them rise up to defend the vulnerable – sometimes unaware of their intimidatory power in attacking those they see as responsible. You will also see them correcting inaccurate reporting.

If anything these notions are more important in a world of social capital: my accuracy, fairness and treatment of the vulnerable dictates part of my social capital. If I make a mistake, my social capital can be damaged for a long time to come – we don’t yet have any concept of our social crimes ‘expiring’ online.

‘Dead’ Osama Bin Laden photos – why have so many news sites published them?

Daily Mail leads with fake dead Bin Laden photo

Both the Daily Mail and the Daily Mirror today – among with several others in the US (including the New York Post, which credits the image to AP) and other countries – published an image purporting to be that of the dead Osama Bin Laden.

It clearly wasn’t.

Any journalist with a drop of cynicism would have questioned the source of the images – even if they did appear on Pakistan television.

It certainly passed the ‘Too good to be true’ test.

Instead, it was users of Reddit and Twitter who first highlighted the dodgy provenance of the image, and the image it was probably based on. Knight News and MSNBC’s Photo blog‘s followed soon after.

It took me all of 10 seconds to verify that it is a fake – by using TinEye to find other instances of the image, I found this example from last April.

But instead of owning up that their image was a fake, both The Daily Mail and Mirror appear to have simply removed the image from their site, leaving that image to circulate amongst their users. Ego, pure and simple.

PS: More on verifying images and other hoax material here.

The law, ethics & effectiveness of PR firms offering bloggers prizes-to-post

A PR firm recently invited me to review their client’s product, saying that if I did review it I would be entered into a prize draw with other ‘qualifying’ bloggers to win an iPad 2.

It was a product I might ordinarily have covered, but this approach made me reluctant.

Here’s reason number 1: I asked myself whether the PR firm will have made the same approach to print journalists. I doubt it. Why? Because it would have raised obvious ethical issues, and questioned the journalists’ professionalism.

So were they assuming that bloggers had different ethics? I doubt they thought that hard – more likely was that some bright spark thought that eager, amateur bloggers would jump at the chance to get anything for their hard work.

Here’s reason number 2: other bloggers will have been approached with the same offer. If they saw me review the product they would assume that I had done so in exchange for this prize draw ticket. They would see me as unprofessional, unethical, or both.

In PR terms, then, the approach was counter-productive: it actually made me less likely to give their client coverage.
Continue reading

Guest post: visualising mobile phone data – the data retention app

datarentention_app

In a guest post Lorenz Matzat, editor of ZEIT Online’s Open Data Blog, writes about the background to their online app exploring the issues around data retention by mobile phone companies.

It’s not very often that one can follow the direct impact of an article, let alone a piece of data journalism. But the visualization of the cellphone data of Malte Spitz from the Green party in Germany led to visible repercussions in the US.

Following a piece in the New York Times about Spitz and the data app, some days ago two senators wrote a letter to the 4 main US-carriers for information about their data retention policy.

After publishing the app in German one month ago (and 20 days later the English version), the feedback was overhelming. We didn’t think that so many people would be so interested in it. But Twitter and Facebook in Germany went wild with it for some days – along with coverage in many major tech websites.

Probably this is why data journalism works: Making an abstract notion everybody knows about visible: that every position of you, and every connection of your mobile phone does is – or could be – logged. Every call, text message and data connection.

The background

Around February 1st, ZEIT Online asked me if I had an idea what do do with the dataset of Malte Spitz (read the background story about the legal action of Spitz to get the data here). Continue reading

Data for journalists: understanding XML and RSS

If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.

There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.

What is XML?

XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.

If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you. Continue reading

The Charlie Sheen Twitter intern hoax – how it could be avoided

Hoax email Charlie Sheen

image from JonnyCampbell

Various parts of the media were hoaxed this week by Belfast student Jonny Campbell’s claim to have won a Twitter internship with Charlie Sheen. The hoax was well planned, and to be fair to the journalists, they did chase up documentation to confirm it. Where they made mistakes provides a good lesson in online verification.

This post is a duplicate version – see the original in full here.

Blocking content sites by ‘self-regulation’ – a recipe for easy censorship

At the start of this month I said that journalists were failing to “protect the public sphere”. Well, here’s just one example of this in action that we need to be watching.

Ed Vaizey, Minister for Culture, Communications and Creative Industries, has confirmed to the Open Rights Group “that discussion are ongoing between rights-holders and Internet Service Providers about ‘self-regulatory’ site-blocking measures.”

For journalists any move in this direction should be particularly concerning, as it provides a non-legal avenue (i.e. without due process) for anyone to suppress information they don’t like.

The point is not blocking sites, but the ease with which it might be done. If distribution van drivers ‘self-regulated’ to stop delivering newspapers whenever anyone complained, publishers and journalists would have a problem. An avenue to appeal doesn’t solve it, because by then the editorial moment will likely have passed – not to mention the extra costs it incurs for content producers.

Here are some precedents from elsewhere:

If you want to write to your MP, you can do so here.

Communities of practice: teaching students to learn in networks

One of the problems in teaching online journalism is that what you teach today may be out of date by the time the student graduates.

This is not just a technological problem (current services stop running; new ones emerge that you haven’t taught; new versions of languages and software are released) but also a problem of medium: genres such as audio slideshows, mapping, mashups, infographics and liveblogging have yet to settle down into an established ‘formula’.

In short, I don’t believe it’s wise to simply ‘teach online journalism’. You have to combine basic principles as they are now with an understanding of how to continue to learn the medium as it develops.

This year I set MA Online Journalism students at Birmingham City University an assignment which attempts to do this.

It’s called ‘Communities of Practice’ (the brief is here). The results are in, and they are very encouraging. Here’s what emerged:

Continue reading