About six months ago, a friend of mine released a new search engine called Duckduckgo. Duckduckgo was based on the much hyped (free) Yahoo BOSS search engine platform, it was well received and now serves hundreds of thousands of searches a day.
Yahoo recently announced BOSS was going to be a paid-for service – surprising a lot of developers. When you’ve built a popular (albeit non-profitable) service on a free platform, and that platform suddenly becomes rather expensive – that eats into your ramen budget.
So when various news agencies announced content delivery developer platforms, I was particularly interested in where they were headed.
There are various services – some free, some paid-for – that developers can use to extract content and valuable information from news agencies. My friend was developing a web application that took content from The Guardian, and automatically printed a bespoke newspaper each day about your favourite topics. He expressed displeasure about The Guardian restricting developers from doing this:
“You will not: Use Open Guardian Platform Content in any printed format”
We’re entering a new age of restrictions and jumping through hoops and loopholes to make awesome content platforms for users.
There are three top platforms for news content which I explore below. I’ll discuss what you can and can’t do technically.
In the future I’ll attempt to translate terms of services into understandable answers to questions like “Can I legally do X?”
NYTimes Developer Network
The primary NYTimes API is the Article Search API. It allows you to extract headlines, abstracts and multimedia from any news article published since 1981. You can’t, unlike The Guardian, extract full content – only leading paragraphs or extracts.
The TimesTags API is an internal-to-NYTimes taxonomy of tags and their relationships. You can easily find related tags to an article or tag.
There is significant value in these taxonomies: if you wanted to aggregate any content that discussed the Boston Celtics but didn’t necessarily include the phrase “Boston Celtics” you’d need to use a tag taxonomy – and these are a real headache to compile from scratch.
The final, and most socially exciting API is the TimesPeople API. This allows you to lookup the tens of millions of daily users on the NYTimes.com website, and extract information as to what their activity is on the website. If they’ve recommended articles, if they’ve rated stories, and profile information such as age and location.
The Guardian Open Platform
When The Guardian says “Open” they really mean it. You can extract full articles, their author and category. The caveat, naturally, is that in the future you’ll have to join The Guardian advertising network, or pay for the content privilege. It turns out that journalists do have to eat. Outrageous.
The Guardian has also spent significant time creating beautiful tag taxonomies for their content. Running a query for listings of tags about “Obama” returns 4: Obama Administration, Barack Obama, Obama Inauguration and Michelle Obama. You can then extract the articles, and further filter them via additional tags or dates.
If you wanted all articles written by Charlie Brooker about ponies dated before April 2007 this would be simple, thanks to The Guardian.
An interesting caveat is that you must not store any content for longer than 24 hours. That isn’t as large an issue as it sounds if you consider how irrelevant news is after 24 hours, and that you can still pull the content a user requires it after it expires.
Further content you can syndicate includes polls, competitions and the rich interactive flash content The Guardian has on its site.
Daylife is the daddy of content developer platforms. A great example of what you can achieve via Daylife is the Zemanta-related content blogging utility: by utilizing the Daylife Image library, it can extract high quality, free-to-use images about what you’re writing about.
Daylife is heavily used by web agencies for media organizations to create sections of content – a one page overview of Barack Obama, referencing content from your news agency, from Wikipedia, Youtube, and related content from other news agencies.
The massive benefit of these content platforms over traditional “scraping” or RSS feeds is that the data is wrapped in information that’ll make it trivial to create a hierarchy of content, find related content and build up “value added content” that readers will demand if you’re trying to persuade them not to go to the original news provider website.
The massive downside is that you’re at the mercy of the news publishers and their developer frameworks. There is no such thing as a free lunch, especially if you’re syndicating content from a news publisher whose revenues have decreased more than 10% in the past year. You’re going to be expected to drive them traffic and subsidize their printing presses.
What are your favourite mashups of the content platforms you’ve seen so far?