Tag Archives: xml

VIDEO: Understanding JSON and XML (when using APIs)

In two previous videos this week I introduced APIs for data journalists, and explained some of the jargon involved. In a short third video — first made for students on the MA in Data Journalism at Birmingham City University and shared as part of a series of video posts — I explain how to understand the data formats you’re likely to come across: JSON and XML.

One useful tool to install in your browser to help with this process is JSONView.

Links mentioned in the video:

Mapping tip: how to convert and filter KML into a list with Open Refine

Original image by GotCredit/Flickr
Original image by GotCredit/Flickr

If you are working with map data that uses the shapes of regions or countries, chances are you’ll need to work with KML. In this guest post (first published on her blog) Carla Pedret explains how you can use the data cleaning tool Open Refine to ‘read’ KML files in order to convert them into other formats (for example to grab the names of places contained in the file).

KML (Keyhole Markup Language) is the default format used by Google’s mapping tool Fusion Tables (Google bought the company which created it in 2004), but it is also used by other mapping tools like CartoDB.

The open source data cleaning tool Open Refine can help you to open, process and convert KML files into other formats in order to, for example, match two datasets (VLOOKUP) or create a new map with the information of the KML file.

What is the difference between XML and KML?

In this post, you will learn how to convert a KML file into XML and download it as aCSV file.

XML – Extensible Markup Language –  is a language designed to describe data and it is used in RSS systems.

XML uses tags like HTML, but there is a big difference between both languages. XML defines the structure of the information, whereas HTML focuses on other elements too, including their meaning and arrangement (even when it is not supposed to focus on appearance), and the importing of other code and media.

KML – Keyhole Markup Language – documents are XML files specific for geographical annotations. KML files contain the parameters to add shapes to maps or three-dimensional Earth browsers like Google Earth.

The big advantage of KML files is the users can customize the maps according to their data and without knowing how to code.

Image of a KML map in Google Fusion Tables
Image of a KML map in Google Fusion Tables

Convert a KML file to XML

You can find KML files using Google Tables search (make sure you have ‘Fusion Tables’ secleted on the left).

Type what you are searching for and add the word geometry or KML.

Captura de pantalla 2016-02-28 a las 19.59.01

Open the fusion table and check that it has shapes by looking for a ‘map’ view (normally this has its own tab).

You should be able to download the KML when looking at that map view by selecting File > Download.

Once downloaded, to convert the file, upload your KML in Open Refine (download Open Refine here) and click Next.KML 1

In the blue box under your data, select XML files.

KML2

Now in the preview you can see the XML file with the structure of the information.

If you want to create a map with your own data and the shapes in the KML file, you need to match the KML with your data.

The example I have used contains the shapes of local authorities in the UK. I want to match the shapes in one dataset (the KML file) with information in another dataset on which party runs each council.

The element both datasets have in common (and therefore the element which will be used to combine them) is the name of the councils. But you need to check that those elements are the same: in other words, are the councils named in exactly the same way in both datasets, including the use of ampersands and other characters?

Have a look at the XML preview and try to find the tags that contain the information you need: in this case, authority names. In the example the tags containing the authority name are <name></name>.

Hover over that element so that you get a dotted box like the one shown below. Click on that rectangle and wait until the process has finished.
Captura de pantalla 2016-02-28 a las 20.29.06

You should then see a column or columns as the picture shows.

Captura de pantalla 2016-02-28 a las 20.33.07

On the right hand side of the page, change the name of your file and click on Create a new project.

Once created, you now only need to export it. Click on Export and select the format you prefer.

KML 5

What originally was a KML file is now a filtered list with data ready to check and match against your other dataset.

Do you use Open Refine? Leave a comment with your tips and techniques or send it to me at @Carlapedret..

Continue reading

How to: convert XML or JSON into spreadsheets using Open Refine

curly brackets

Curly brackets pattern by Dan McCullough

One of the most useful applications of the data cleaning tool Open Refine (formerly Google Refine) is converting XML and JSON files into spreadsheets that you can interrogate in Excel.

Surprisingly, I’ve never blogged about it. Until now. Continue reading

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

XML puzzle cube
Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

  1. You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
  2. You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter.

Converting an XML feed into a table

If you have some information in XML format it helps if you have some understanding of how XML is structured. A backgrounder on how to understand XML is covered in this post explaining XML for journalists.

It also helps if you are using a browser which is good at displaying XML pages: Chrome, for example, not only staggers and indents different pieces of information, but also allows you to expand or collapse parts of that, and colours elements, values and attributes (which we’ll come on to below) differently.

Say, for example, you wanted a spreadsheet of UK council data, including latitude, longitude, CIPFA code, and so on – and you found the data, but it was in XML format at a page like this: http://openlylocal.com/councils/all.xml

To pull that into a neatly structured spreadsheet in Google Docs, type the following into the cell where you want the import to begin (try typing in cell A2, leaving the first row free for you to add column headers):

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council”)

The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:

=importXML(“theURLinQuotationMarks”, “theBitWithinTheURLthatYouWant”)

The URL is relatively easy – it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.

The query tells Google Docs which bit of the XML you want to pull out. It uses a language called XPath – but don’t worry, you will only need to note down a few queries for most purposes.

Here’s an example of part of that XML file shown in the Chrome browser:

XML from OpenlyLocal

The indentation and triangles indicate the way the data is structured. So, the <councils> tag contains at least one item called <council> (if you scrolled down, or clicked on the triangle to collapse <council> you would see there are a few hundred).

And each <council> contains an <address>, <authority-type>, and many other pieces of information.

If you wanted to grab every <council> from this XML file, then, you use the query “//council” as shown above. Think of the // as a replacement for the < in a tag – you are saying: ‘grab the contents of every item that begins <council>’.

You’ll notice that in your spreadsheet where you have typed the formula above, it gathers the contents (called a value) of each tag within <council>, each tag’s value going into their own column – giving you dozens of columns.

You can continue this logic to look for tags within tags. For example, if you wanted to grab the <name> value from within each <council> tag, you could use:

=ImportXML(“http://openlylocal.com/councils/all.xml”, ”//council//name”)

You would then only have one column, containing the names of all the councils – if that’s all you wanted. You could of course adapt the formula again in cell B2 to pull another piece of information. However, you may end up with a mismatch of data where that information is missing – so it’s always better to grab all the XML once, then clean it up on a copy.

If the XML is more complex then you can ask more complex questions – which I’ll cover in the second part of this post. You can also put the URL and/or query in other cells to simplify matters, e.g.

=ImportXML(A1, B1)

Where cell A1 contains http://openlylocal.com/councils/all.xml and B1 contains //council (note the lack of quotation marks). You then only need to change the contents of A1 or B1 to change the results, rather than having to edit the formula directly)

If you’ve any other examples, ideas or corrections, let me know. Meanwhile, I’ve published an example spreadsheet demonstrating all the above techniques here.

 

SFTW: How to scrape webpages and ask questions with Google Docs and =importXML

XML puzzle cube

Image by dullhunk on Flickr

Here’s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML.

There are at least 2 distinct journalistic uses for =importXML:

  1. You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.
  2. You want to extract some information from a webpage – perhaps on a regular basis – and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.

The first task is the easiest, so I’ll explain how to do that in this post. I’ll use a separate post to explain the latter. Continue reading

The style challenge

Odd one out - image by Cliff Muller
Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them.

They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous.

And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age. Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather.

Immersion and language

Style is a fundamental skill in journalism. It is difficult to teach, because it relies on an individual immersing themselves in media, and doing so in a way that goes beyond each message to the medium itself.

This is why journalism tutors urge their students so strongly to read as many newspapers as they can; to watch the news and listen to it, obsessively.

Without immersion it is difficult to speak any language.

Now. Some people do immerse themselves and have a handle on current affairs. That’s useful, but not the point. Some do it and gain an understanding of institutions and audiences (that one is left-leaning; this one is conservative with a small c, etc.). This is also useful, but also not the point.

The point is about how each institution addresses each audience, and when.

Despite journalists and editors often having an intuitive understanding of this difference in print or broadcast, over the last decade they’ve often demonstrated an inability to apply the same principles when it comes to publishing online.

And so we’ve had shovelware: organisations republishing print articles online without any changes.

We’ve had opinion columns published as blogs because ‘blogs are all about opinion’.

And we’ve had journalists treating Twitter as just another newswire to throw out headlines.

This is like a person’s first attempt at a radio broadcast where they begin by addressing “Hey all you out there” as if they’re a Balearic DJ.

Good journalists should know better.

Style serves communication

Among many other things a good journalism or media degree should teach not just the practical skills of journalism but an intellectual understanding of communication, and by extension, style.

Because style is, at its base, about communication. It is about register: understanding what tone to adopt based on who you are talking to, what you are talking about, the relationship you seek to engender, and the history behind that.

As communication channels and tools proliferate, we probably need to pay more attention to that.

Journalists are being asked to adapt their skills from print to video; from formal articles to informal blog posts; from Facebook Page updates to tweets.

They are having to learn new styles of liveblogging, audio slideshows, mapping and apps; to operate within the formal restrictions of XML or SEO.

For freelance journalists commissioning briefs increasingly ask for that flexibility even within the same piece of work, offering an extra payments for an online version, a structured version, a podcast, and so on.

These requests are often quite basic – requiring a list of links for an online version, for example – but as content management systems become more sophisticated, those conditions will become more stringent: supplying an XML file with data on a product being reviewed, for example, or a version optimised for search.

What complicates things further is that, for many of these platforms, we are inventing the language as we speak it.

For those new to the platform, it can be intimidating. But for those who invest time in gaining experience, it is an enormous opportunity. Because those who master the style of a blog, or Facebook, or Twitter, or addressing a particular group on Flickr, or a YouTube community, put themselves in an incredible position, building networks that a small magazine publisher would die for.

That’s why style is so important – now more than ever, and in the future more than now.

The style challenge

Odd one out - image by Cliff Muller

Spot the odd one out. Image by Cliff Muller

Time was when a journalist could learn one or two writing styles and stick with them. They might command enormous respect for being the best at what they did. But sometimes, when that journalist moved to another employer, their style became incongruous. And they couldn’t change.

This is the style challenge, and it’s one that has become increasingly demanding for journalists in an online age.

Because not only must they be able to adapt their style for different types of reporting; not only must they be able to adapt for different brands; not only must they be able to adapt their style within different brands across multiple media; but they must also be able to adapt their style within a single medium, across multiple platforms: Twitter, Facebook, blogs, Flickr, YouTube, or anywhere else that their audiences gather. Continue reading

Data for journalists: understanding XML and RSS

If you are working with data chances are that sooner or later you will come across XML – or if you don’t, then, well, you should do. Really.

There are some very useful resources in XML format – and in RSS, which is based on XML – from ongoing feeds and static reference files to XML that is provided in response to a question that you ask. All of that is for future posts – this post attempts to explain how XML is relevant to journalism, and how it is made up.

What is XML?

XML is a language which is used for describing information, which makes it particularly relevant to journalists – especially when it comes to interrogating large sets of data.

If you wanted to know how many doctors were privately educated, or what the most common score was in the Premiership last season, or which documents were authored by a particular civil servant, then XML may be useful to you. Continue reading

Games, systems and context in journalism at News Rewired

I went to News Rewired on Thursday, along with dozens of other journalists and folk concerned in various ways with news production. Some threads that ran through the day for me were discussions of how we publish our data (and allow others to do the same), how we link our stories together with each other and the rest of the web, and how we can help our readers to explore context around our stories.

Continue reading

Could Mashlogic be the answer to infoglut in the Web 2.0 world?

Combating information overload in the Internet age can be a tricky thing. The reader is often overwhelmed with the plethora of Web sites and news portals, and the publisher has to come up with a way to retain loyal users who will stick to their brand even while they are taken from hyperlink to hyperlink through an endless loop of news stories on a singular topic of interest.

Mashlogic, a tool that allows users to personalize their Web searches and define information on their own terms, promises to change that. The site assures readers that it can bring relief to their “RSS indigestion” woes in the Internet age.

Consumer version

In addition to allowing the user to choose his or her most trusted sources of news on the Web, the consumer version of Mashlogic, which can be downloaded as a plugin for the Firefox or IE browser, permits readers to outline topics of interest in order to adapt Web-surfing to their needs.

“Mashlogic adds a layer of contextual information to casual viewing experience on a Web site,” says John Bryan, vice president of business development.

Users can go to the Mashlogic site and build their own “mashes.” Here, they can customize source feeds, which may include everything from brand names such as the Guardian or the New York Times, to aggregate mixes, which may incorporate celebrity news and sports teams they follow, and content from bloggers and tweeters. Everything from Wikipedia definitions to LinkedIn profiles of people mentioned in articles can be tracked based on a user’s interest. Mashlogic also allows readers to highlight and choose sources and order them based on their priorities. Little wonder then, that Techcrunch is calling it a “Swiss Army Knife for hyperlinks.” Behind the scenes, the tool scans RSS and XML feeds from the chosen sites for “strings of words” in Web pages based on the user’s pre-selected choices.

2Stay Tool

Internet readers trying to distill information overload on the Web aren’t the only ones who can take advantage of Mashlogic. Companies and news sites that are interested in preserving their brand, retaining readers and generating page views and revenue can utilize the company’s more recent tool, aptly named, “2Stay.”

Here, the publisher takes a few lines of java script and embeds it on a page. When the tool looks for matching terms on a site that has this embedded script, a branded box alerting the reader to relevant articles from the site itself will pop up as the user drags his cursor over specific terms. It gives site owners a way to let users navigate news on their site without having to rely on search engines, which can often turn up irrelevant information from untrusted sources. The technology works on two levels – it looks at direct tags, which would redirect the reader to articles based solely on words or phrases, and also contextually scans tags around a term, yielding associated tags, and hence secondary stories. This not only prompts the user to stay on a site longer, but also directs traffic to more popular – and hence, more profitable – parts of a Web site.

“It keeps people on the site for longer and allows people to navigate around a site. It’s a way of drilling down archival content,” says Bryan. “What’s really cool about it from the publisher’s perspective is that we have the ability to drive people from a low cpm area to a high cpm area.”

When I ask him how this is different from the “most popular” or “most commented” articles that most sites showcase, Bryan reminds me that it’s not a contest, “We don’t see Mashlogic as being a replacement to any of the other tools that you have on your site.”

Nevertheless, he is quick to point out that a lot of such lists are usually buried at the end of an article on conventional Web sites, or that they often take a reader through a maze of related stories, without the option of going back to the original article. The Mashlogic tool, on the other hand, opens up relevant stories in different tabs, aiding the horizontal reading experience, literally.

“What we offer the user is a way of quickly finding the associated article without leaving the page.” The tool is also intuitive in the sense that it recognizes terms that would be of interest to the user, and the longer time one spends on a site, the deeper it starts to reference buried content.

One of the places this technology works best, according to Bryan, is in the case of celebrity news. As if to reinforce this point he shows me how you can follow stories tagged with Indianapolis football star Peyton Manning on the citizen sports site, Bleacher Report. Merely moving the cursor over the quarterback’s name prompts a callout, which gleans Manning stories from all around the site – a list that includes everything from his team’s latest victory to his place on the NFL power rankings.

But could this excess of Peyton Manning news, so characteristic of niche information and fragmented audiences in the online world, carry with it the very real danger of obscuring the more important news items? Would this entice readers to spend too much time on Manning and too little on the health care bill, for instance?

“I’d like to think they’d use it for both,” says Bryan. In the age of democratization of the Web, the user should indeed be able to choose what he reads and where he reads it. And Mashlogic allows him to do this well. If, in fact, a user were interested in healthcare, the technology would allow him to access the leading magazines, sites, blogs, forums and even tweets on the topic, to create a 360-degree view. “Mashlogic does that better than anybody else because we would scour all the sources that you said you trusted or wanted to reference,” Bryan says.

2Go Tool

The company’s third product, “To Go” is for the ultimate brand fanatic. The brand can be anything from a preferred site to a favorite sports team or celebrity, or even a topic of interest. Readers would be required to download a button from their chosen sites, which would offer one-click access from anywhere on the Web.

Hence, 2Go is for the reader what  2Stay is for the publisher. “As a user, I have opted in to the have the ability to jump back, never be more than one click away from my favorite site,” explains Bryan.

Sure enough, as we traverse the ESPN site for news, a Bleacher Report-branded callout pops up, with related stories on B/R, ready to take the B/R fan back to his preferred source with one click. Mashlogic is currently in negotiations with about ten companies to install this tool, and according to Bryan, it’s being pretty well received.

Thus, what the three technologies being offered collectively do is adapt a reader’s experience to his preferences while allowing publishers to retain their most loyal users on their sites. “Mashlogic does not affect the way a site works, in any shape or form, the site works just the way it works,” Bryan says, as he closes an annoying popup ad.

The company has developed a pretty savvy e-commerce strategy for revenue generation. Any references to books or music in articles can directly take the user to the Amazon or iTunes site to purchase a specific item. The technology is also cleverly using third party sites to play sample music for the user, before he chooses to buy it. The feature can reference video, audio and text URLs. Hence, an NPR callout can jump the reader straight to a podcast from their broadcasts. Bryan also envisions having the callouts sponsored by advertisers. What would be more apt than having a Clorox callout advising a reader about environmentally-friendly Green Works products as he reads about the H1N1 virus, he reasons.

Too much distraction, perhaps? In an Internet age where readers are already in danger of encountering endlessly tantalizing hyperlinks, one too many sidebars, and interactive rich-media advertising, do they need more? But, on the other hand, don’t you want to be alerted to that contextual piece on Sarah Palin, as you glimpse through an article about her latest gaffe on a news show?

“A lot of the content, which is still very relevant tends to fall off the radar due to breaking news stories; it’s still pretty relevant, it’s just not current,” as Bryan points out. Mashlogic has the potential to combat the low attention span of the Internet age and bring that content to readers’ attention. In addition, it can provide them with the hundred and seventy-sixth article on Jon and Kate that they may have missed. What’s not to love about that?