If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.
There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.
What is XML?
XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.
If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you.
(That said, this post doesn’t show you how to do any of that – it is mainly aimed at explaining how XML works so that you can begin to think about those possibilities.)
XML stands for “eXtensible Markup Language”. It’s the ‘markup’ bit which is key: XML ‘marks up’ information as being something in particular: relating to a particular date, for example; or a particular person; or referring to a particular location.
For example, a snippet of XML like this -
- tells you that the ‘Paris’ in this instance is a city, rather than a celebrity. And that it’s in France, not Texas.
That makes it easier for you to filter out information that isn’t relevant, or combine particular bits of information with data from elsewhere.
For example, if an XML file contains information on authors, you can filter out all but those by the person you’re interested in; if it contains publication dates, you can use that to plot associated content on a timeline.
Most usefully, if you have a set of data yourself such as a spreadsheet, you can pull related data from a relevant XML file. If your spreadsheet contains football teams and the XML provides locations, images, and history for each, then you can pull that in to create a fuller picture. If it contains addresses, there are services that will give you XML files with the constituency for those postcodes.
What is RSS?
RSS is a whole family of formats which are essentially based on XML – so they are structured in the same way, containing ‘markup’ that might tell you the author, publication date, location or other details about the information it relates to.
There is a lot of variation between different versions of RSS, but the main thing for the purposes of this post is that the various versions of RSS, and XML, share a structure which journalists can use if they know how to.
Which version isn’t particularly important: as long as you understand the principles, you can adapt what you do to suit the document or feed you’re working with.
Looking at XML and RSS
XML documents (for simplicity’s sake I’ll mostly just refer to ‘XML’ for the rest of this post, although I’m talking about both XML and RSS) contain two things that are of interest to us: content, and information about the content (‘markup’).
Information about the content is contained within tags in angle brackets (also known as chevrons): ‘<’ and ‘>’
For example: <name> or <pubDate> (publication date).
The tag is followed by the content itself, and a closing tag that has a forward slash, e.g. </name> or </pubDate>, so one line might look like this:
At this point it’s useful to have some XML or RSS in front of you. For a random example go to the RSS feed for the Scottish Government News.
To see the code right-click on that page and select View Source or similar – Firefox is worth using if another browser does not work; the Firebug extension also helps. (Note: if the feed is generated by Feedburner this won’t work: look for the ‘View Feed XML‘ button in the middle right area or add ?format=xml to the feed URL).
What you should see will include the following:
<item> <title>Manufactured Exports Q4 2010</title> <link>http://www.scotland.gov.uk/News/Releases/2011/04/06100351</link> <description>A National Statistics publication for Scotland.</description> <guid isPermaLink="true">http://www.scotland.gov.uk/News/Releases/2011/04/06100351</guid> <pubDate>Wed, 06 Apr 2011 00:00:00 GMT</pubDate> </item>
In the RSS feed itself this doesn’t start until line 14 (the first 13 lines are used to provide information about the feed as a whole, such as the version of RSS, title, copyright etc).
But from line 14 onwards this pattern repeats itself for a number of different ‘items’.
As you can see, each item has a title, a link, a description, a permalink, and a publication date. These are known as child elements (the item is the parent, or the ‘root element’).
More journalistic examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which provides XML files on all parts of government, from MPs and peers to debates and constituencies, going back over a decade. Look at the Ministers XML file in Firefox and scroll down until you get to the first item tagged <ministerofficegroup>. Within each of those are details on ministerial positions. As the Parliament Parser page explains:
“Each one has a date range, the MP or Lord became a minister at some time on the start day, and stopped being one at some time on the end day. The matchid field is one sample MP or Lord office which that person also held. Alternatively, use the people.xml file to find out which person held the ministerial post.”
You’ll notice from that quote that some parts of the XML require cross-referencing to provide extra details. That’s where XML becomes very useful.
Using it in practice: working with XML in Yahoo! Pipes
Yahoo! Pipes provides a good introduction in working with data in XML or RSS. You’ll need to sign up at Pipes.Yahoo.com and click on ‘Create a Pipe‘.
You’ll now be editing a new project. On the left hand column are various ‘modules’ you can use. Click on ‘Sources‘ to expand it, and click and drag ‘Fetch Feed’ onto the graph paper-style canvas.
If you now click on the module so that it turns orange, you should be able (after a few moments) see that feed in the Debugger window at the bottom of the screen.
Click on the handle in the middle to pull it up and see more, and click on the arrows on the left to drill down to the ‘nested’ data within each item.
As you drill down you can see elements of data you can filter. In this case, we’ll use ‘region‘.
To filter the feed based on this we need the Filter module. On the left hand side click on ‘Operators‘ to expand that, and then drag the ‘Filter‘ module into the canvas.
Now drag a pipe from the circle at the bottom of the ‘Fetch Feed’ module to the top of the ‘Filter’ module.
Wait a moment for the ‘Filter’ module to work out what data the RSS feed contains. Then use the drop down menus so that it reads “Permit items that match all of the following”.
The next box determines which piece of data you will filter on. If you click on the drop-down here you should see all the pieces of data that are associated with each item.
We’re going to select ‘region’, and say that we only want to permit items where ‘region’ contains ‘North West’. If any of these don’t make any sense, look at the original RSS feed again to see what they contain.
Now drag a final pipe from the bottom of the ‘Filter’ module to the top of ‘Pipe output‘ at the bottom of the canvas. If you click on either you should be able to see in the Debugger that now only those items relating specifically to the North West are displayed.
If you wanted to you could now save this and click ‘Run Pipe‘ to see the results. Once you do you should notice options to ‘Get as RSS‘ – this would allow you to subscribe to this feed yourself or publish it on a website or Twitter account. There’s also ‘Get as JSON’ which is a whole other story – I’ll cover JSON in a future post.
Oh, and a sidenote: if you wanted to grab an XML file in Yahoo! Pipes rather than an RSS feed, you would use ‘Fetch Data’ instead of ‘Fetch Feed’.
Just the start
There’s much more you can do here. Some suggestions for next steps:
- Try using the Text Input module in Yahoo! Pipes, dragging a line from that to where you typed ‘North West’, for example
- Try playing with the importXML formula in Google Spreadsheets
- Try using matching data in a spreadsheet with data from an XML file using Google Refine.
Those are for future posts. For now I just want to demonstrate how XML works to add information-about-information which you can then use to search, filter, and combine data.
And it’s not just an esoteric language that is used by a geeky few as part of their newsgathering: journalists at Sky News, The Guardian and The Financial Times – to name just a few – all use this as a routine part of publishing, because it provides a way to dynamically update elements within a larger story without having to update the whole thing from scratch – for example by updating casualty numbers or new dates on a timeline.
And while I’m at it, if you have any examples of XML being used in journalism for either newsgathering or publishing, let me know.