Brion Vibber <brion <at> pobox.com> writes:
[parts elided for brevity]
If your site doesn't appear to violate the content
license, and
you follow earlier suggestions about caching data (preferably
backed by your own local copy to begin with) and you limit your
hit rate and don't abuse our search engine or other tools, you
probably won't get blocked.
I have been scrupulously careful about the license terms, and even
provide a link (and suggestion text for it), on every page, to the
Wikimedia Foundation fund-raising page. (Incidentally, I repeat
that this is not a dead-exact mirror, in that each topic page also
includes the dmoz links, if any can be found, for that topic, thus
providing--so far as I can determine--a unique value-added page.)
I also have in place (beside a Crawl-delay: 5 statement in the
robots.txt file) a throttler that blocks any entity trying to
exceed thast 5-second limitation for longer than a half minute or
so, so as to keep ill-behaved searchbots (or email harvesters) from
rapid attacks.
We run our web site, not yours, and if our system
administrators
feel your site's eating more load than it should it may well get
cut off at some point.
I understand and completely agree. Had I known any of this before
starting up, I would have gone about things in a very different
manner. My problem--not, I agree, yours, but I am just *asking*
for a little help, not trying to demand anything--is not that I
don't want to or won't convert to a local system using database
dumps; I will, _but_ this is an ongoing project, and it will no
doubt take me some little time to learn enough to set up and
operate a database system and the associated software, matters of
which I am at present wholly ignorant.
A general recommendation for remote loading: if
you're not
already doing this, always use a relevant user-agent string which
identifies your site well enough that the other site's admins can
find and contact you. (Including an e-mail address is strongly
recommended; an URL might be insufficient if your site is being
slashdotted or DoSed.)
Thank you for the idea; I have now added a "From: " header giving
my email.
So, again, I wonder if, should I manage to quickly adapt or invent
a usable parser and take files "on demand" (as visitors hit a
topic) purely via XML, at least for the time--which could be a
while--that a crash course in database management and some
associated things takes me, would I be allowed to operate?
I have begun working on such a parser (I discovered that trying to
adapt the existing ones is likely more work than starting anew); I
have asked the sysadmin I first dealt with, via email, if I could
have a "grace period" of about a week to continue as I have been
for the past several months while I develop and test a parser and
switch to XML exports, after which I will begin trying to set up a
full dump-based back end, which, as I say, will be quite a learning
process. But I thought it best to also make that request here, so
as to not seem to be trying to go around the community in any way.
Should I be allowed the grace period (I suggested this coming
Friday, but time having gone by, I would ask instead for this
coming Sunday), would someone lift the current block?
I apologize for the time this thread is taking, but I have acted in
good faith, if, it appears, in ignorance. I want to get into full
compliance as soon as I can, but I am praying that I will not be
knocked wholly off the air till I can accomplish that.