On April 14, Evan Martin wrote:
To answer your specific proposal:
1)
http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag:
<meta name="robots" content="noindex,follow" />
which indicates it's explicitly disallowed from being crawled.
As far as I understand the robots meta tag, "noindex,follow" tells
robots that they are welcome to fetch the page, that they can find
links to other pages here (= follow), but they should never show
this page among the search hits (= noindex).
Words such as crawl and index are somewhat fuzzy here. Does
"index" mean fetch or does it mean store in an index, to be
returned to users as a search hit? I found no clear answer. Of
course, the crawler/robot/spider is already fetching the page when
it sees the meta tag. And it must fetch the page again to see if
the meta tag has changed.
The Pipermail software that is used for the wikitech-l archive
sets "noindex,follow" for the overview sorted by date, e.g.
http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/date.html
but for the individual posting, it sets "index,nofollow", e.g.
http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034969.html
I believe that "noindex,follow" is used for many "sitemap" pages,
and this is my idea of how search robots should use RecentChanges.
Indeed, the front page of any newspaper website is also similar to
a sitemap. Its content changes so often that it becomes useless
to index it under any specific word found there. If people search
for "hurricane katrina", they don't want the front page of the
Washington Post, which will have changed by the time they arrive.
But they might be interested in the news article about this topic,
and the front page was the way to harvest the link to that
article.
The main difference, then, between the newspaper and Wikipedia is
that the newspaper uses their RecentChanges as their front page.
Plus the fact that Wikipedia isn't covered by Google News.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se