Eric Walker wrote:
Brion Vibber <brion <at> pobox.com>
writes:
We make available public database dumps of the
page databases for all
our wikis for the express purpose of making it easy for people to reuse
massive amounts of our content on their own sites as well as to perform
private research, republishing in other formats, etc. Updates are
somewhat intermittent while we're moving database servers around, but
occur roughly every couple of weeks. The last dump was made on May 16.
They're available at
http://dumps.wikimedia.org/
Yes, so I have discovered. But I wonder about the trade-off between my
taking files, especially if I can manage to use the XML feeds effectively,
"on demand" versus my taking many gigabytes of data every two to four
weeks--which would average to something like a gigabyte a day. Which--in
any sense of the word--"costs" more?
Downloading a dump from time to time takes a bit of bandwidth and
virtually no processing power for us. Our scarce resources are in the
database load and the wiki's web processing, which aren't touched by this.
We pump out something on the order of 100Mbps, so if you were to
download, say, the 900 megabyte English Wikipedia current-article dump
once a week that would be a pretty small dent.
We cannot
guarantee that you will never be blocked; if your site becomes
problematic it may very well be, but if the site is well-behaved it
probably will not be.
"Well-behaved" is, I fear, in the eye of the beholder. I dearly want to be
well-behaved, and a good citizen in all ways, but I am still not sure I know
what the considerations are. Was I 403'ed because the actual volume of
transactions was too high, or simply because I was using "remote loading"?
Generally, if your volume's big enough to be noticed leeching page views
and searches off our main servers, you've got a fair chance of being
blocked. (I seem to remember hearing something about the search engine
being hit in this case, but I may be thinking of something else.)
Having your rewritten output include ad banners or remove the Wikipedia
attribution, copyright/licencing info, etc is pretty much certain to get
you blocked. (I have never seen your web site personally, I don't know
if these apply.) We are under no obligation to act as a back-end web
server for other peoples' web sites.
Yet again, just so, but . . . . I am already paying
what is, for me as an
individual, anyway, a dearly high monthly for a "high-volume" account, yet
I get but 5 GB of storage, scarcely a fraction of even just the English-
language database.
I hate to say "not my problem", but honestly it's your responsibility,
not ours, to find and pay for the hosting of *your* web site.
I wonder that there is not in place some system for
dealing with situations
like mine, which I can hardly think unique. I would be more than happy to
pay some plausible fee for accesses, either to XML or to HTML, rather than
try to re-invent the wheel.
[snip]
I realize that Wikipedia is "open source",
which is usually held to be equal
to "free", but is there no place for mixing "free to public visitors"
and a
modest fee for serving, shall we say, on the wholesale rather than the
retail side?
You may wish to propose this to the Wikimedia Foundation board of
directors; they can be reached by e-mail at: board at
wikimedia.org
Meanwhile, my question for the instant moment is: if
the block is taken off,
and I take the XML Export files on an on-demand basis, will I be OK?
If your site doesn't appear to violate the content license, and you
follow earlier suggestions about caching data (preferably backed by your
own local copy to begin with) and you limit your hit rate and don't
abuse our search engine or other tools, you probably won't get blocked.
But again, there are no guarantees. We run our web site, not yours, and
if our system administrators feel your site's eating more load than it
should it may well get cut off at some point.
A general recommendation for remote loading: if you're not already doing
this, always use a relevant user-agent string which identifies your site
well enough that the other site's admins can find and contact you.
(Including an e-mail address is strongly recommended; an URL might be
insufficient if your site is being slashdotted or DoSed.)
If we know who you are and how to contact you, and we know that you've
made a good-faith effort, you're more likely to get the benefit of the
doubt. But if there's a necessity, such as a huge flash crowd sending
far too many requests to us due to poor throttling on your site, you
might well find yourself blocked without prior notice.
-- brion vibber (brion @
pobox.com)