The OJB guide to open news APIs – part 1: Guardian, NYT and Daylife

In the first of a series, Peter Clark, founder of Broadersheet, takes a look at three of the leading APIs for people looking to build news-based web projects and mashups.

About six months ago, a friend of mine released a new search engine called Duckduckgo. Duckduckgo was based on the much hyped (free) Yahoo BOSS search engine platform, it was well received and now serves hundreds of thousands of searches a day.

Yahoo recently announced BOSS was going to be a paid-for service – surprising a lot of developers. When you’ve built a popular (albeit non-profitable) service on a free platform, and that platform suddenly becomes rather expensive – that eats into your ramen budget.

So when various news agencies announced content delivery developer platforms, I was particularly interested in where they were headed.

There are various services – some free, some paid-for – that developers can use to extract content and valuable information from news agencies. My friend was developing a web application that took content from The Guardian, and automatically printed a bespoke newspaper each day about your favourite topics. He expressed displeasure about The Guardian restricting developers from doing this:

“You will not: Use Open Guardian Platform Content in any printed format”

We’re entering a new age of restrictions and jumping through hoops and loopholes to make awesome content platforms for users.

There are three top platforms for news content which I explore below. I’ll discuss what you can and can’t do technically.

In the future I’ll attempt to translate terms of services into understandable answers to questions like “Can I legally do X?”

NYTimes Developer Network

The primary NYTimes API is the Article Search API. It allows you to extract headlines, abstracts and multimedia from any news article published since 1981. You can’t, unlike The Guardian, extract full content – only leading paragraphs or extracts.

The TimesTags API is an internal-to-NYTimes taxonomy of tags and their relationships. You can easily find related tags to an article or tag.

There is significant value in these taxonomies: if you wanted to aggregate any content that discussed the Boston Celtics but didn’t necessarily include the phrase “Boston Celtics” you’d need to use a tag taxonomy – and these are a real headache to compile from scratch.

The final, and most socially exciting API is the TimesPeople API. This allows you to lookup the tens of millions of daily users on the NYTimes.com website, and extract information as to what their activity is on the website. If they’ve recommended articles, if they’ve rated stories, and profile information such as age and location.

The Guardian Open Platform

When The Guardian says “Open” they really mean it. You can extract full articles, their author and category. The caveat, naturally, is that in the future you’ll have to join The Guardian advertising network, or pay for the content privilege. It turns out that journalists do have to eat. Outrageous.

The Guardian has also spent significant time creating beautiful tag taxonomies for their content. Running a query for listings of tags about “Obama” returns 4: Obama Administration, Barack Obama, Obama Inauguration and Michelle Obama. You can then extract the articles, and further filter them via additional tags or dates.

If you wanted all articles written by Charlie Brooker about ponies dated before April 2007 this would be simple, thanks to The Guardian.

An interesting caveat is that you must not store any content for longer than 24 hours. That isn’t as large an issue as it sounds if you consider how irrelevant news is after 24 hours, and that you can still pull the content a user requires it after it expires.

Further content you can syndicate includes polls, competitions and the rich interactive flash content The Guardian has on its site.

Daylife

Daylife is the daddy of content developer platforms. A great example of what you can achieve via Daylife is the Zemanta-related content blogging utility: by utilizing the Daylife Image library, it can extract high quality, free-to-use images about what you’re writing about.

Daylife is heavily used by web agencies for media organizations to create sections of content – a one page overview of Barack Obama, referencing content from your news agency, from Wikipedia, Youtube, and related content from other news agencies.

The massive benefit of these content platforms over traditional “scraping” or RSS feeds is that the data is wrapped in information that’ll make it trivial to create a hierarchy of content, find related content and build up “value added content” that readers will demand if you’re trying to persuade them not to go to the original news provider website.

The massive downside is that you’re at the mercy of the news publishers and their developer frameworks. There is no such thing as a free lunch, especially if you’re syndicating content from a news publisher whose revenues have decreased more than 10% in the past year. You’re going to be expected to drive them traffic and subsidize their printing presses.

What are your favourite mashups of the content platforms you’ve seen so far?

14 thoughts on “The OJB guide to open news APIs – part 1: Guardian, NYT and Daylife

  1. Andraz Tori

    Hi Paul,

    very good post on all this APIs. When looking at them this is a great first step. But from a standpoint of a developer [who has to chose what to use] the question is: why can’t this guys get along?

    Why every single one of them has to have their own taxonomy and not even provide _any_ mappings to anyone else. Why should I code against their de-facto closed data?

    Why do I have to learn three different types of APIs instead of having just one library and query language [think SQL for all the data you mention].

    There are really big taxonomies and list of entities that are already out there and which all three mentioned companies (NYT, Guardian, Daylife) could use as a “standard vocabulary” – Wikipedia, Freebase, DBPedia [doesn’t matter which one, since developers have access to mappings between them].

    When they realize that their data becomes so much more valuable when it is linked to existing big vocabularies, then wonderful things become possible. Right now there is not actually much built by developers on top of those APIs [apart from the demo projects], exactly because they are ‘data islands’.

    [and this is the first question we ask when someone approaches Zemanta to integrate their content into our suggestion pool: Do you have your content correlated with any of the big vocabularies?]

    I hope the guys start moving into more connected-data direction.

    bye
    Andraz Tori, Zemanta

    Reply
  2. Peter C

    Andraz, an awesome question. We too (at Broadersheet) would appreciate correlation in all these APIs. It’s ridiculous there is so little (if any) collaboration.

    Are you cats based in London? We’re in Cambridge and would love to hear your thoughts about this over a coffee. (Or at Jee Camp)

    Reply
  3. Pingback: Plataformas abertas em discussão « Charles Cadé Blog / Comunicação, tecnologia e criatividade

  4. Andraz Tori

    Peter,

    while we are London based company, me & whole development team is in Slovenia (feel free to visit, it’s a great place! 🙂

    Peter, I am also interested in what you are working on at Broadersheet?

    Generally we try to persuade people to join the Linking Open Data and link their data with existing ones. Also, we’re not a strong believers in ‘grand semantic web’ vision, but using URIs as global identifiers and some semi-global vocabulary makes things much easier to integrate.

    Btw: we also offer an API, do you know it?

    bye
    Andraz Tori, Zemanta

    Reply
  5. Andraz Tori

    Hi Matt,

    yeah, I know that both have left the door open, which is a very good thing! I really hope something great comes out of it eventually! Probably I allowed too much of my momentary frustration to get out in the comment.

    We worked a lot on large scale multi-domain cross-publisher data integration and we know it is not easy. But some heuristics and machine learning can get you very far quickly. We did initial version of guardian tagswikipedia mappings internally…

    About help… I think you already know what our engine can do regarding disambiguation to LOD, finding entities in unstructured text, etc., so we’d be glad to help!

    Andraz Tori, Zemanta

    Reply
  6. Arjun Ram

    Andaz,

    These are early days in the news API’s, it is actually good that they guys have their APIs because they are still innovating. This is the time to explore and innovate rather than confirming to a standard API. Classic example is no of words in an article, not sure which one offers it but one of has and reminder don’t.

    But your bigger point is well taken! They could/should offer an mapping to the linked data space. It will lead to dream of the semantic web coming true.

    Think of this scenario: Find tickets to a metro in India near the ocean that hasnt had a terrorism in 12 months

    Kayak API + Weather API + News API + geonames API

    News API should very clear knock mumbai out of the picture because the terrorism quotient would be high for mumbai. You think Kayak will not pay a portion of its sales to the News API provider?

    So this is the time to innovate so lets not stifle innovation? In fact I would suggest that they media do a better job of standardizing data on their pages like contributors, co-authors. Way too many of them have no standard way to even do this.

    Back in the day my grandpa used to be able to follow his select 3 journalist/columnist. Even though technology has progressed so much it is not trivial for a nontechie person to do this same in their web site. Something is fundamentally wrong.

    Matt McAlister,
    I would request that you folks open up your taxonomy. I am sure there are some jewels in there that could contribute to the bigger news taxonomy.

    BTW we are exploring all of these ideas in our startup based out of this small town called bangalore 😉

    Thanks for providing a platform for continuing this discussion. While it is painful to see icons of news media go through this turbulence, exciting times lie ahead!

    Cheers,
    Arjun

    Reply
  7. Pingback: links for 2009-04-23 « Ex Orbite

  8. ken ellis

    Andraz, Peter, actually we do provide mappings to different data sets, although its not well advertised. Given one of our topics, either a URL or an ID, you can retrieve Freebase and Wikipedia endpoints. Also, given a Wikipedia entry or a Freebase ID, you can find the matching topic within our system. Right now this covers about 65,000 topics. We also have a module we’re releasing to some clients in the near future that use Freebase to pull in some structured data and mark up the page a simple RDFa standard.

    The calls I mentioned above are detailed at the link associated with this post (http://labs.daylife.com/semanticweb.html), and don’t require authentication unlike the free or pay versions of our API.

    Regards,
    Ken Ellis

    Reply
  9. Pingback: The Guardian kicks off the local data landgrab | Online Journalism Blog

  10. Pingback: The Guardian kicks off the local data landgrab 

  11. Hendrik

    Hi,
    in Germany the discussion that big news portals should have open APIs has neither started yet. Instead they are still fighting with G*gle News that their content has been stolen. The advantage of opening the platforms by using an API has not reached their minds..unfortunately.

    KR hendrik

    Reply
  12. Pingback: Plataformas abertas em discussão | Cadé

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.