Re: [Wikitech-l] [wbrown@inktomi.com: Inktomi web crawler]

20 Aug 2003

Brion,

The correct delay instruction for robots.txt is "Crawl-delay:".  Had a 
typo in my previous message.  So the appearance in a robots file would be

    User-agent: Slurp
    Disallow: /whatever-paths-should-be-excluded
    Crawl-delay: 20
    # 20 second delay between hits

Will wait awhile so you have a chance to update the robots.txt, then 
will schedule wikipedia.org for a site discovery crawl so we can get 
your content back into the index.

Thanks,
Warren

Brion Vibber wrote:

...
  ----- Forwarded
message from Warren Brown &lt;wbrown(a)inktomi.com&gt; -----
From: "Warren Brown" &lt;wbrown(a)inktomi.com&gt;
Date: Mon, 18 Aug 2003 17:25:08 -0700
To: root(a)wikipedia.org
Subject: Inktomi web crawler

The wikipedia.org server is blocking Inktomi's "Slurp" web crawler by
returning 403 errors for all access attempts.  Presumably, this block
was set up because we were crawling the site too aggressively at some
time in the past.  We would like to include wikipedia.org content in our
search database, and would be happy to work with you to match whatever
crawling limits you need to set.

Slurp observes /robots.txt rules for user-agent "Slurp".   The crawler
access rate is normally limited to 4 pages per minute from a web server;
we can set that rate lower if you require.

That should be fine; I've taken 'Slurp' out of our bot-blocker list.
Thanks for dropping us a note!

  The Slurp access rate can
also be controlled by a "crawldelay" instruction in /robots.txt.

<files away for future reference>

It'd be lovely if all spiders supported this. 4 pages per minute is fine
for one or two big spiders, but if fifty search engines are _each_
spidering our entire site at 4 pages per minute, it may be a bit more than
we'd like. :)

I assume the line would go something like:

User-agent: Slurp
crawldelay: 20
# 20 second delay between hits

?

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [wbrown@inktomi.com: Inktomi web crawler]