“Imagine if you could search for any calculations and then just use them directly without ever having to work it out yourself from scratch.”
Over at Help Me Investigate Health I’ve just published a bunch of 20 places to keep up to date with clinical commissioning. It’s an example of something I’ve written about previously – setting up an online network infrastructure as a journalist. And below, I explain the process behind it:
Following CCGs across local newspapers and blogs
If you’re going to start scrutinising a field, it’s very useful to be kept up to date with developments in that field:
- Concerns raised in one local newspaper may be checked elsewhere;
- Specialist magazines may provide guides to jargon or processes that helps save you a lot of time;
- Politicians might raise concerns and get answers;
- And expert bloggers can provide leads and questions that you might want to follow up.
Rather than checking a list of websites on the off chance that one has been updated, a much more efficient way to keep up to date on what’s happening is to use a free RSS reader. Continue reading
There have been quite a few tools springing up over the past few months that I’ve not had time to blog about, so here’s a roundup post on all of them – a bumper Something For The Weekend (let me know how you find these).
1. Junar – for scraping websites and sharing data
Junar presents a much easier way to scrape data from online tables with its ‘Collect Data‘ tool – and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs. Continue reading
Google have launched a ‘Search by Image’ service which allows you to find images by uploading, dragging over, or pasting the URL of an existing image.
The service should be particularly useful to journalists seeking to verify or debunk images they’re not sure about.
(For examples where it may have been useful, look no further than this week’s Gay Syrian Blogger story, as well as the ‘dead’ Osama Bin Laden images that so many news outlets fell for)/
TinEye, a website and Firefox plugin, does the same thing – but it will be interesting to see if Google’s service is more or less powerful (let me know how you get on with it) Find it here. Video here.
Here’s another data analysis tool which is worth keeping an eye on. Extractiv “lets you transform unstructured web content into highly-structured semantic data.” Eyes glazing over? Okay, over to ReadWriteWeb:
“To test Extractive, I gave the company a collection of more than 500 web domains for the top geolocation blogs online and asked its technology to sort for all appearances of the word “ESRI.” (The name of the leading vendor in the geolocation market.)
“The resulting output included structured cells describing some person, place or thing, some type of relationship it had with the word ESRI and the URL where the words appeared together. It was thus sortable and ready for my analysis.
“The task was partially completed before being rate limited due to my submitting so many links from the same domain. More than 125,000 pages were analyzed, 762 documents were found that included my keyword ESRI and about 400 relations were discovered (including duplicates). What kinds of patterns of relations will I discover by sorting all this data in a spreadsheet or otherwise? I can’t wait to find out.”
What that means in even plainer language is that Extractiv will crawl thousands of webpages to identify relationships and attributes for a particular subject.
This has obvious applications for investigative journalists: give the software a name (of a person or company, for example) and a set of base domains (such as news websites, specialist publications and blogs, industry sites, etc.) and set it going. At the end you’ll have a broad picture of what other organisations and people have been connected with that person or company. Relationships you can ask it to identify include relationships, ownership, former names, telephone numbers, companies worked for, worked with, and job positions.
It won’t answer your questions, but it will suggest some avenues of enquiry, and potential sources of information. And all within an hour.
Time and cost
ReadWriteWeb reports that the process above took around an hour “and would have cost me less than $1, after a $99 monthly subscription fee. The next level of subscription would have been performed faster and with more simultaneous processes running at a base rate of $250 per month.”
As they say, the tool represents “commodity level, DIY analysis of bulk data produced by user generated or other content, sortable for pattern detection and soon, Extractiv says, sentiment analysis.”
Which is nice.
When I first saw Freebase Gridworks I was a very happy man. Here was a tool that tackled one of the biggest problems in data journalism: cleaning dirty data (and data is invariably dirty). The tool made it easy to identify variations of a single term, and clean them up, to link one set of data to another – and much more besides.
It’s notable that Google are explicitly positioning Refine in their video (above) as a “data journalism” tool.
You can download Google Refine here.
Further videos below. The first explains how to take a list on a webpage and convert it into a cleaned-up dataset – a useful alternative to scraping:
The second video explains how to link your data to data from elsewhere, aka “reconciliation” – e.g. extracting latitude and longitude or language.
This is a draft from a book chapter on data journalism (here are parts 1; two; and three, which looks the charts side of visualisation). I’d really appreciate any additions or comments you can make – particularly around tips and tools.
So if you want to visualise some data or text, how do you do it? Thankfully there are now dozens of free and cheap pieces of software that you can use to quickly turn your tables into charts, graphs and clouds.
The best-known tool for creating word clouds is Wordle (wordle.net). Simply paste a block of text into the site, or the address of an RSS feed, and the site will generate a word cloud whose fonts and colours you can change to your preferences. Similar tools include Tagxedo (tagxedo.com) and Wordlings (http://wordlin.gs), both of which allow you to put your word cloud into a particular shape.
ManyEyes (manyeyes.alphaworks.ibm.com/manyeyes/) also allows you to create word clouds and tag clouds – as well as word trees and phrase nets that allow you to see common phrases. But it is perhaps most useful in allowing you to easily create scattergrams, bar charts, bubble charts and other forms. The site also contains a raft of existing data that you can play with to get a feel for the site. Similar tools that allow access to other data include Factual (factual.com), Swivel (swivel.com)[see comments], Socrata (socrata.com) and Verifiable.com (verifiable.com). And Google Fusion Tables (tables.googlelabs.com) is particularly useful if you want to collaborate on tables of data, as well as offering visualisation options.
More general visualisation tools include widgenie (widgenie.com), iCharts (icharts.net), ChartTool (onlinecharttool.com) and ChartGo (www.chartgo.com). FusionCharts is a piece of visualisation software with a Google Gadget service that publishers may find useful. You can find instructions on how to use it at www.fusioncharts.com/GG/Docs
If you want more control over your visualisation – or want it to update dynamically when the source information is updated, Google Chart Tools (code.google.com/apis/charttools) is worth exploring. This requires some technical knowledge, but there is a lot of guidance and help on the site to get you started quickly.
Tableau Public is a piece of free software you can download (tableausoftware.com/public) with some powerful visualisation options. You will also find visualisation options on spreadsheet applications such as Excel or the free Google Docs spreadsheet service. These are worth exploring as a way to quickly generate charts from your data on the fly.
Publishing your visualisation
There will come a point when you’ve visualised your data and need to publish it somehow. The simplest way to do this is to take an image (screengrab) of the chart or graph. This can be done with a web-based screencapture tool like Kwout (kwout.com), a free desktop application like Skitch (skitch.com) or Jing (jingproject.com), or by simply using the ‘Print Screen’ button on a PC keyboard (cmd+shift+3 on a Mac) and pasting the screengrab into a graphics package such as Photoshop.
The advantage of using a screengrab is that the image can be easily distributed on social networks, image sharing websites (such as Flickr), and blogs – driving traffic to the page on your site where it is explained.
If you are more technically minded, you can instead choose to embed your chart or graph. Many visualisation tools will give you a piece of code which you can copy and paste into the HTML of an article or blog post in the place you wish to display it (this will not work on most third party blog hosting services, such as WordPress.com). One particular advantage of this approach is that the visualisation can update itself if the source data is updated.
Showing your raw data
It is generally a good idea to give users access to your raw data alongside its visualisation. This not only allows them to check it against your visualisation but add insights you may not otherwise gain. It is relatively straightforward to publish a spreadsheet online using Google Docs (see the sidebar on publishing a spreadsheet)
SIDEBAR: How to: publish a spreadsheet online
Google Docs (docs.google.com) is a free website which allows you to create and share documents. You can share them via email, by publishing them as a webpage, or by embedding your document in another webpage, such as a blog post. This is how you share a spreadsheet:
- Open your spreadsheet in Google Docs. You can upload a spreadsheet into Google Docs if you’ve created it elsewhere – there is a size limit, however, so if you are told the file is too big try removing unnecessary sheets or columns.
- Look for the ‘Share’ button (currently in the top right corner) and click on it.
- A drop-down menu should appear. Click on ‘Publish as a web page’
- A new window should appear asking which sheets you want to publish. Select the sheet you want to publish and click ‘Start publishing’ (you should also make sure ‘Automatically republish when changes are made’ is ticked if you want the public version of the spreadsheet to update with any data you add.)
- Now the bottom half of that window – ‘Get a link to the published data’ – should become active. In the bottom box should be a web address where you can now see the public version of your spreadsheet. If you want to share that, copy the address and test that it works in a web browser. You can now link to it from any webpage.
- Alternatively, you can embed your spreadsheet – or part of it – in another webpage. To do this click on the first drop-down menu in this area – it will currently say ‘Web page’ – and change it to ‘HTML to embed in a page’. Now the bottom box on this window should show some HTML that begins with
- If you want to embed just part of a spreadsheet, in the box that currently says ‘All cells’ type the range of cells you wish to show. For example, typing A1:G10 will select all the cells in your spreadsheet from A1 (the first row of column A) to G10 (the 10th row of column G). Once again, the HTML below will change so that it only displays that section of your spreadsheet.
Once again, I’d welcome any comments on things I may have missed or tips you can add. Part 5, on mashups, is now available here.