Brion,
The correct delay instruction for robots.txt is "Crawl-delay:". Had a
typo in my previous message. So the appearance in a robots file would be
User-agent: Slurp
Disallow: /whatever-paths-should-be-excluded
Crawl-delay: 20
# 20 second delay between hits
Will wait awhile so you have a chance to update the robots.txt, then
will schedule
for a site discovery crawl so we can get
your content back into the index.
Thanks,
Warren
Brion Vibber wrote:
----- Forwarded
message from Warren Brown <wbrown(a)inktomi.com> -----
From: "Warren Brown" <wbrown(a)inktomi.com>
Date: Mon, 18 Aug 2003 17:25:08 -0700
To: root(a)wikipedia.org
Subject: Inktomi web crawler
The
wikipedia.org server is blocking Inktomi's "Slurp" web crawler by
returning 403 errors for all access attempts. Presumably, this block
was set up because we were crawling the site too aggressively at some
time in the past. We would like to include
wikipedia.org content in our
search database, and would be happy to work with you to match whatever
crawling limits you need to set.
Slurp observes /robots.txt rules for user-agent "Slurp". The crawler
access rate is normally limited to 4 pages per minute from a web server;
we can set that rate lower if you require.
That should be fine; I've taken 'Slurp' out of our bot-blocker list.
Thanks for dropping us a note!
The Slurp access rate can
also be controlled by a "crawldelay" instruction in /robots.txt.
<files away for future reference>
It'd be lovely if all spiders supported this. 4 pages per minute is fine
for one or two big spiders, but if fifty search engines are _each_
spidering our entire site at 4 pages per minute, it may be a bit more than
we'd like. :)
I assume the line would go something like:
User-agent: Slurp
crawldelay: 20
# 20 second delay between hits
?
-- brion vibber (brion @
pobox.com)