On 10/26/07, Anthony <wikimail(a)inbox.org> wrote:
Have Google and Yahoo been informed of this policy?
No, since they're our number-one referers.
BTW, that talks about articles, not images. And it
contradicts
robots.txt, especially "## we're disabling this experimentally
11-09-2006\n#Crawl-delay: 1"
It seems to stem from something said on the Village Pump back in 2003.
I for one am going to go with robots.txt, not something someone said
on some Wikipedia page.
I believe a more accurate story would be as follows:
1) Live mirrors of the site, however big or small, are discouraged
without prior agreement. You're supposed to use the dumps for this.
If you want to provide some kind of useful value-added "gateway" or
framing or whatever, that for instance marks up the pages in some
useful way or whatever, *and* you very clearly acknowledge the source
and give a link, *and* you don't run ads or similar, *and* you don't
use too much bandwidth, that's probably fine (although best to ask
first). If you don't meet the preceding conditions, you may be asked
to pay a fee for the mirroring service, or face blocking.
2) Anything that uses enough server resources to slow down the site
will probably be blocked or killed if it's noticed. In the old days
this was a concern, but nowadays it's probably not.
There was a page I once saw where someone had put up the statement
that bots should only request pages once every ten seconds or
something. When I looked in the histories, I saw that Brion had added
it in like 2003, along with a description of the hardware Wikipedia
was being run on: a single server with one Pentium CPU. Later someone
removed the part of that edit with the grossly-outdated server
description, but neglected to remove the by then ludicrous blanket
restriction on crawlers.
Anyway, it comes down to this: it's always courteous to ask, but if
you don't cause any actual damage probably nobody will notice or care.
Don't take that as any official party line, I'm not a sysadmin, but
that seems to hold as far as I can tell.