Update: Chris Gaither from Google explained how to get removed from Google News while remaining in the main index here and here.
There’s a story in Australia that News Corp. is preparing to sue Google and Yahoo to stop both from linking to, and quoting News Corp content. It comes as Rupert Murdoch promises to start charging for online content across his company’s news sites.
The suing story has prompted the usual hilarity, with comments such as if murdoch sues google & yahoo over news rather than use robots.txt file, it’ll be a short, embarrassing lawsuit. But here’s why Murdoch might have a case (first posted here) …
Robots.txt isn’t a panacea
The usual response to newspapers’ complaints about Google is to say ‘just use robots.txt to keep them out.’ This was Google’s response in its two fingers to the news industry.
However, most people don’t seem to realise that it’s hard to stay out of Google News and remain in the main Google index:
Please keep in mind that the robot we use for Google News, called Googlebot, is the same robot that we use for Google Web Search. This means that any settings you modify for Google News will also apply to Google Web Search. (From Google Support)
There’s a difference between Google News and Google Search
Google search is a way for a user to enter a term and for Google to show relevant pages. Google News these days looks like a fully fledged news aggregation service – check out its front page, and tell me how much that differs from a publisher’s news home page?
Just because publishers are happy to appear in normal search results, doesn’t mean they want their content used for free to create a rival news source/product. But there’s no way to use robots.txt – google’s supposed answer – to draw this distinction.
Google is ignoring ACAP
Publishers have attempted to help Google out with their own protocol called Automated Content Access Protocol – a way to build on robots.txt and allow better control over how their content is used.
Google won’t implement it saying that: “Our guiding principle is that whatever technical standards we introduce must work for the whole web (big publishers and small), not just for one subset or field”.
But Google already draws a distinction between big and small publishers. I publish a blog, but I’m not allowed in Google News, even though I’m in the main Google index.
Conclusion
I’m not saying that any publisher will actually want to stay out of Google. But robots.txt isn’t the answer to the problem of how publishers get paid for or control access to their content.
That distinction that you say keeps your blog out of Google News isn’t actually much of an issue. Once you know what to do it’s relatively easy to get into Google News. We got The Lichfield Blog on there and pretty dominate any Google News search for Lichfield now.
I had the blog’s founder do a guest post on it for me: Google news registration is an easy win.
The range-of-contributors condition would keep me out – there’s only me at my blog. (Unlike here on OJB where there are lots of contributors).
Ah my mistake, I thought you were referring to OJB.
Pingback: Raj’s Concepts Reflection & Research Project « Learning Log for Internet Communications
Thanks for that link Philip – have submitted OJB to see what happens. Notice BTW in the technical requirements they require 3 digits in your URLs (and it can’t be the year), which neither yours nor OJB meets.
It’s a bizarre requirement and you’re right, we don’t have it and neither do the Guardian by the looks of it.
The Guardian have been getting away with it for years – I never understood how. In better news all round, the 3-digit requirement is no longer true IF you submit a news sitemap.
Here’s the link that explains the waiver for the 3-digit rule: http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323
And here’s the explanation of news sitemaps: http://www.google.com/support/news_pub/bin/answer.py?answer=74288
There’s a plugin to generate google news sitemaps for wordpress here, although I’ve never used it myself: http://wordpress.org/extend/plugins/google-news-sitemap-generator/
So, just how do you get a site crawled by the Google News bots?
I’m probably missing something, but Google says it will remove sites people don’t want included, so why is there any talk of suing them or using robots?
http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=94003
Do they refuse to remove some sites?
>Thanks for that link Philip – have submitted OJB to see what happens. Notice BTW in the technical requirements they require 3 digits in your URLs (and it can’t be the year), which neither yours nor OJB meets.
As has been mentioned, if you use a news sitemap you can dodge that criteria.
I think that is perhaps easier for a clearly defined news site such as the Lichfield Blog.
Or it could be sour grapes :-). I’ve tried a couple of times without success.
There’s a plugin for the news sitemap (which I can’t make work), or I use a short php script here:
http://www.mattwardman.com/blog/news-sitemap.php
Matt
PS The only political blog site I’m aware of that is in is Slugger.
Pingback: "Paid content" myth sprouts more pay wall ideas | Zombie Journalism
Pingback: Google’s Fast Flip - a cruel joke on the news industry
Pingback: Google Fast Flip – A Cruel Joke? « 18CARATBRASS.COM – WEIRD NEWS AND FUNNYS
Pingback: Google Fast Flip – A Cruel Joke? | AlanSpicer.com