On 21/07/2009, at 6:48 PM, Daniel Schwen wrote:
wouldn't it be faster than to actually create a
static HTML dump the
traditional way?
The content is wiki-text. It has to be parsed to be turned into
HTML. There
isn't a more traditional way, because there is no other way.
Wouldn't it be possible to dump the parser cache instead of dumping
XML and reparsing? Al the parsing work is already done on the
Wikimedia servers, why do it again on a slow desktop system?
For a few reasons:
1/ There's no reason to expect that the contents of every page,
revision, et cetera, would be in the parser cache.
2/ Deleted or otherwise private revision content may remain in the
parser cache.
3/ There would be a lot of redundant content in the parser cache,
owing to people browsing with the same options.
4/ None of the useful article metadata is stored in the parser cache.
5/ The parser cache is stored in memcached, a hash-based system which
it is impossible to simply "dump", let alone dump selectively
excluding all of the other things stored in memcached (including quite
a bit of private data).
It might, however, be sensible to generate parsed HTML text for every
page, save them in a directory, and then zip it up.
Oh, wait...
--
Andrew Garrett
agarrett(a)wikimedia.org
http://werdn.us/