Re: [Wikitech-l] Retrieving Articles for Mirror

11 Jun 2005

Eric Walker wrote:
...
  I have been fetching individual articles from the
wikipedia site as
 visitors request them (once fetched, they are given some php-based
 processing, and the rest of the page built around them).  Apparently that
 is a no-no, owing, I am told, to the server load, especially from
 searchbots that may follow out the pages.

 I was, after some months of operation, suddenly hit with a 403 block; on
 inquiry, I discovered the facts above.  I then asked whether using instead
 the Special:Export XML access would be an acceptable way of fetching
 articles individually on demand.  The sysadmin wrote that he felt it would,
 but that I would be best to post here to see if others agree or disagree. 
Hi Eric. Our resources are provided by donations from our users and
supporters to keep Wikipedia and our other projects available to human
readers and contributors.

Please remember that you're using someone else's servers, paid for with
someone else's money, to run your web site for you. As an uninvited
guest, you need to be mindful about how you use your host's resources.

Limit the number of connections you make and cache resources locally
once they've been retrieved. Understand and use HTTP caching headers
when available (such as the Last-Modified and Is-Modified-Since
headers). In particular remember that a rush of connections to your
site, such as a flash crowd (slashdot!) or a search engine spidering
loops of links, can cause your site to pass a *huge* number of requests
on to ours.

We make available public database dumps of the page databases for all
our wikis for the express purpose of making it easy for people to reuse
massive amounts of our content on their own sites as well as to perform
private research, republishing in other formats, etc. Updates are
somewhat intermittent while we're moving database servers around, but
occur roughly every couple of weeks. The last dump was made on May 16.

They're available at http://dumps.wikimedia.org/

I would strongly recommend that you make use of these database dumps if
possible, and avoid hitting our servers at all. If you _really_ need the
most up-to-date pages you can use the Special:Export interface to grab
source text, and render it within your own MediaWiki installation.

...
  I realize that there is no easy way to convert the
marked-up text to HTML,
 but I am prepared to cobble up some php to essay the task--but, before
 going to that nontrivial effort, 
Our code's all open source and you should feel free to use it for this
purpose: http://www.mediawiki.org/

...
  I would like to be sure that I will not
 again be blocked even if I am accessing individual articles via
 Special:Export XML.  (At present, I seem to be getting perhaps 20,000
 visitors a day.) 
We cannot guarantee that you will never be blocked; if your site becomes
problematic it may very well be, but if the site is well-behaved it
probably will not be.

Most of all, remember that if you use a complete database dump you can
avoid any reliance on our site being up, down, unavailable, or blocking
you at any given time. This will make your site more resilient against
downtime, network troubles, and slow servers as well as the possibility
that you might get blocked.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Retrieving Articles for Mirror