Monthly Archives: July 2011

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

XML puzzle cube
Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

  1. You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
  2. You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter.

Converting an XML feed into a table

If you have some information in XML format it helps if you have some understanding of how XML is structured. A backgrounder on how to understand XML is covered in this post explaining XML for journalists.

It also helps if you are using a browser which is good at displaying XML pages: Chrome, for example, not only staggers and indents different pieces of information, but also allows you to expand or collapse parts of that, and colours elements, values and attributes (which we’ll come on to below) differently.

Say, for example, you wanted a spreadsheet of UK council data, including latitude, longitude, CIPFA code, and so on – and you found the data, but it was in XML format at a page like this: http://openlylocal.com/councils/all.xml

To pull that into a neatly structured spreadsheet in Google Docs, type the following into the cell where you want the import to begin (try typing in cell A2, leaving the first row free for you to add column headers):

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council”)

The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:

=importXML(“theURLinQuotationMarks”, “theBitWithinTheURLthatYouWant”)

The URL is relatively easy – it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.

The query tells Google Docs which bit of the XML you want to pull out. It uses a language called XPath – but don’t worry, you will only need to note down a few queries for most purposes.

Here’s an example of part of that XML file shown in the Chrome browser:

XML from OpenlyLocal

The indentation and triangles indicate the way the data is structured. So, the <councils> tag contains at least one item called <council> (if you scrolled down, or clicked on the triangle to collapse <council> you would see there are a few hundred).

And each <council> contains an <address>, <authority-type>, and many other pieces of information.

If you wanted to grab every <council> from this XML file, then, you use the query “//council” as shown above. Think of the // as a replacement for the < in a tag – you are saying: ‘grab the contents of every item that begins <council>’.

You’ll notice that in your spreadsheet where you have typed the formula above, it gathers the contents (called a value) of each tag within <council>, each tag’s value going into their own column – giving you dozens of columns.

You can continue this logic to look for tags within tags. For example, if you wanted to grab the <name> value from within each <council> tag, you could use:

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council//name”)

You would then only have one column, containing the names of all the councils – if that’s all you wanted. You could of course adapt the formula again in cell B2 to pull another piece of information. However, you may end up with a mismatch of data where that information is missing – so it’s always better to grab all the XML once, then clean it up on a copy.

If the XML is more complex then you can ask more complex questions – which I’ll cover in the second part of this post. You can also put the URL and/or query in other cells to simplify matters, e.g.

=ImportXML(A1, B1)

Where cell A1 contains http://openlylocal.com/councils/all.xml and B1 contains //council (note the lack of quotation marks). You then only need to change the contents of A1 or B1 to change the results, rather than having to edit the formula directly)

If you’ve any other examples, ideas or corrections, let me know. Meanwhile, I’ve published an example spreadsheet demonstrating all the above techniques here.

 

Kit Review: Gymbl Pro iPhone Mount

The Gymbl – makes your iPhone not quite a good phone or a good camera

Jon Hickman reviews iPhone tripod Gymbl Pro.

Jonathan Ive didn’t design my iPhone with a pistol grip. Instead of a hard, brittle feeling, bumpy, plastic case, Jonathan Ive fashioned a fetish object wrapped in perfectly smooth flat glass. Jonathan Ive did not design the Gymbl Pro, by Youbiq.

Would Jonathan Ive use a Gymbl Pro in pistol grip mode to shoot a video?

No, he wouldn’t. Jonathan Ive would use an iPhone 4 to grab short informal videos. If he was in the field and had a chance to grab an important interview with somebody, he’d not hesitate to use his iPhone 4 in the hand. Jony would know that informal handheld shooting would add a sense of urgency or intimicay to his media file – he’d say: “There’s no need for handheld slickness, we’re over that now”. Continue reading

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

XML puzzle cube

Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

  1. You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
  2. You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter. Continue reading

The style challenge

Odd one out - image by Cliff Muller
Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them.

They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous.

And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age. Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather.

Immersion and language

Style is a fundamental skill in journalism. It is difficult to teach, because it relies on an individual immersing themselves in media, and doing so in a way that goes beyond each message to the medium itself.

This is why journalism tutors urge their students so strongly to read as many newspapers as they can; to watch the news and listen to it, obsessively.

Without immersion it is difficult to speak any language.

Now. Some people do immerse themselves and have a handle on current affairs. That’s useful, but not the point. Some do it and gain an understanding of institutions and audiences (that one is left-leaning; this one is conservative with a small c, etc.). This is also useful, but also not the point.

The point is about how each institution addresses each audience, and when.

Despite journalists and editors often having an intuitive understanding of this difference in print or broadcast, over the last decade they’ve often demonstrated an inability to apply the same principles when it comes to publishing online.

And so we’ve had shovelware: organisations republishing print articles online without any changes.

We’ve had opinion columns published as blogs because ‘blogs are all about opinion’.

And we’ve had journalists treating Twitter as just another newswire to throw out headlines.

This is like a person’s first attempt at a radio broadcast where they begin by addressing “Hey all you out there” as if they’re a Balearic DJ.

Good journalists should know better.

Style serves communication

Among many other things a good journalism or media degree should teach not just the practical skills of journalism but an intellectual understanding of communication, and by extension, style.

Because style is, at its base, about communication. It is about register: understanding what tone to adopt based on who you are talking to, what you are talking about, the relationship you seek to engender, and the history behind that.

As communication channels and tools proliferate, we probably need to pay more attention to that.

Journalists are being asked to adapt their skills from print to video; from formal articles to informal blog posts; from Facebook Page updates to tweets.

They are having to learn new styles of liveblogging, audio slideshows, mapping and apps; to operate within the formal restrictions of XML or SEO.

For freelance journalists commissioning briefs increasingly ask for that flexibility even within the same piece of work, offering an extra payments for an online version, a structured version, a podcast, and so on.

These requests are often quite basic – requiring a list of links for an online version, for example – but as content management systems become more sophisticated, those conditions will become more stringent: supplying an XML file with data on a product being reviewed, for example, or a version optimised for search.

What complicates things further is that, for many of these platforms, we are inventing the language as we speak it.

For those new to the platform, it can be intimidating. But for those who invest time in gaining experience, it is an enormous opportunity. Because those who master the style of a blog, or Facebook, or Twitter, or addressing a particular group on Flickr, or a YouTube community, put themselves in an incredible position, building networks that a small magazine publisher would die for.

That’s why style is so important – now more than ever, and in the future more than now.

The style challenge

Odd one out - image by Cliff Muller

Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them. They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous. And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age.

Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather. Continue reading

SFTW: How to grab useful political data with the They Work For You API

They Work For You

It’s been over 2 years since I stopped doing the ‘Something for the Weekend’ series. I thought I would revive it with a tutorial on They Work For You and Google Refine…

 

If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs for those constituencies – the They Work For You API can save you hours of fiddling – if you know how to use it. Continue reading

SFTW: How to grab useful political data with the They Work For You API

They Work For You

It’s been over 2 years since I stopped doing the ‘Something for the Weekend’ series. I thought I would revive it with a tutorial on They Work For You and Google Refine…

If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs for those constituencies – the They Work For You API can save you hours of fiddling – if you know how to use it.

An API is – for the purposes of journalists – a way of asking questions for reams of data. For example, you can use an API to ask “What constituency is each of these postcodes in?” or “When did these politicians enter office?” or even “Can you show me an image of these people?”

The They Work For You API will give answers to a range of UK political questions on subjects including Lords, MLAs (Members of the Legislative Assembly in Northern Ireland), MPs, MSPs (Members of the Scottish Parliament), select committees, debates, written answers, statements and constituencies.

When you combine that API with Google Refine you can fill a whole spreadsheet with additional political data, allowing you to answer questions you might otherwise not be able to.

I’ve written before on how to use Google Refine to pull data into a spreadsheet from the Google Maps API and the UK Postcodes API, but this post takes things a bit further because the They Work For You API requires something called a ‘key’. This is quite common with APIs so knowing how to use them is – well – key. If you need extra help, try those tutorials first. Continue reading

How to collaborate (or crowdsource) by combining Delicious and Google Docs

RSS girl by Heather Weaver

RSS girl by HeatherWeaver on Flickr

During some training in open data I was doing recently, I ended up explaining (it’s a long story) how to pull a feed from Delicious into a Google Docs spreadsheet. I promised I would put it down online, so: here it is.

In a Google Docs spreadsheet the formula =importfeed will pull information from an RSS feed and put it into that spreadsheet. Titles, links, datestamps and other parts of the feed will each be separated into their own columns.

When combined with Delicious, this can be a useful way to collect together pages that have been bookmarked by a group of people, or any other feed that you want to analyse.

Here’s how you do it: Continue reading

Learning about community strategies: 10 lessons

Back in February I blogged about the process of teaching journalism students to think about working with communities. The results have been positive: even where the strategy itself wasn’t successful, the individuals have learned from its execution, its research, or both. And so, for those who were part of this process – and anyone else who’s interested – I thought I would summarise 10 key themes that came through the resulting work.

1. A community strategy isn’t something you can execute effectively in one month

Perhaps the number one lesson that people drew from the experience was that they should have started early, and done little, often, rather than a lot all at once. There was a tendency to underestimate the needs of community management and a need for better time management.

Communities needed time to “grow organically”, wrote one; it wasn’t a top down approach. Members might also have felt they were being “manipulated” when weeks of inactivity were followed by a flood of posts, links and questions.

Continue reading

FAQ: How can broadcasters benefit from online communities?

Here’s another set of questions I’m answering in public in case anyone wants to ask the same:

How can broadcasters benefit from online communities?

Online communities contain many individuals who will be able to contribute different kinds of value to news production. Most obviously, expertise, opinion, and eyewitness testimony. In addition, they will be able to more effectively distribute parts of a story to ensure that it reaches the right experts, opinion-formers and eyewitnesses. The difference from an audience is that a community tends to be specialised, and connected to each other.

If you rephrase the question as ‘How can broadcasters benefit from people?’ it may be clearer.

How does a broadcaster begin to develop an engaged online community, any tips?

Over time. Rather than asking about how you develop an online community ask yourself instead: how do you begin to develop relationships? Waiting until a major news event happens is a bad strategy: it’s like waiting until someone has won the lottery to decide that you’re suddenly their friend.

Journalists who do this well do a little bit every so often – following people in their field, replying to questions on social networks, contributing to forums and commenting on blogs, and publishing blog posts which are helpful to members of that community rather than simply being about ‘the story’ (for instance, ‘Why’ and ‘How’ questions behind the news).

In case you are aware of networks in the middle east, do you think they are tapping into online communities and social media adequately?

I don’t know the networks well enough to comment – but I do think it’s hard for corporations to tap into communities; it works much better at an individual reporter level.

Can you mention any models whether it is news channels or entertainment television which have developed successful online communities, why do they work?

The most successful examples tend to be newspapers: I think Paul Lewis at The Guardian has done this extremely successfully, and I think Simon Rogers’ Data Blog has also developed a healthy community around data and visualisation. Both of these are probably due in part to the work of Meg Pickard there around community in general.

The BBC’s UGC unit is a good example from broadcasting – although that is less about developing a community as about providing platforms for others to contribute, and a way for journalists to quickly find expertise in those communities. More specifically, Robert Peston and Rory Cellan-Jones use their blogs and Twitter accounts well to connect with people in their fields.

Then of course there’s Andy Carvin at NPR, who is an exemplar of how to do it in radio. There’s so much written about what he does that I won’t repeat it here.

What are the reasons that certain broadcasters cannot connect successfully with online communities?

I expect a significant factor is regulation which requires objectivity from broadcasters but not from newspapers. If you can’t express an opinion then it is difficult to build relationships, and if you are more firmly regulated (which broadcasting is) then you take fewer risks.

Also, there are more intermediaries in broadcasting and fewer reporters who are public-facing, which for some journalists in broadcasting makes the prospect of speaking directly to the former audience that much more intimidating.