Tag Archives: leads

Words as data: how data journalists tell stories about documents and text

Documents and other collections of text can be goldmines for data journalism — if you know how to approach them as data. Here are some techniques and inspiration for your next data project.

From stories about political speech and song lyrics, to street names and social media chatter, data journalists now have a wide range of examples of text-as-data to draw inspiration and guidance from, while tools such as Pinpoint and NotebookLM are making text analysis easier than ever.

I compiled a list of over 200 pieces of data journalism where text or documents were used as sources. Quantification techniques ranged from counting the frequency of a single word and using Google’s ngram viewer, to machine learning and topic modelling.

Looking at those articles it’s clear that, once quantified, journalists tell the same stories about text as any other piece of data: using the seven most common angles.

But how those angles are used — and how often — is where it gets interesting…

7 common angles for data stories: text and documents 
Scale: how often words/phrases are used
Change: how language has changed
Ranking: the most/least common words/phrases
Variation: e.g. in relation to gender, ethnicity, ideology etc.
Exploration: journeys through multiple angles; interactives
Relationships: correlations, similarities and connections
Meta: ‘how we quantified text’
Leads: clusters, patterns or themes for further digging
Continue reading

Could moderators collect potential leads from comments?

Guardian community moderator Todd Nash* makes an interesting suggestion on his blog about the difficulties journalists face in wading through comments on their stories:

“there is potential for news stories to come out of user activity on newspaper websites. Yet, as far as I know, it is not a particularly well-utlised area. Time is clearly an issue here. How many journalists have time to scroll through all of their comments to search for something that could well resemble a needle in a haystack? It was commented that, ironically, freelancers may make better use of this resource as their need for that next story is greater than their staff member counterparts.

“The moderation team at guardian.co.uk now has a Twitter feed @GuardianVoices which highlights good individual comments and interesting debate. Could they be used as a tool to collect potential leads? After all, moderators will already be reading the majority of content of the publication they work for. However, it would require a rather different mindset to look out for story leads compared to the more usual role of finding and removing offensive content.”

It’s an idea worth considering – although, as Todd himself concludes:

“Increased interactivity with users builds trust, which in turn produces a higher class of debate and, with it, more opportunities for follow-up articles. Perhaps it is now time for the journalists to take inspiration from their communities as well.”

That aside, could this work? Could moderators work to identify leads?

*Disclosure: he’s also a former student of mine