Anthony wrote:
There are a ton of possible solutions. Do you have
access to the dump
server and permission to implement any of them? That seems to be the
bottleneck.
Yes I have access, but I don't have time. You don't need access to the
dump server to implement improvements, it's all open source:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
You can just submit a patch.
Personally, I think a good backward-compatible
improvement would be to
only regenerate the parts of the bzip2 file which have changed. Bzip2
resets its compression every 900K or so of uncompressed text, plus the
specification treats the concatenation of two bzip2 files as
decompressing to the same as the bzip2 of the concatenation of the two
uncompressed files (I hope that made sense). So if the files are
ordered by title then by revision time there should be a whole lot of
chunks which don't need to be uncompressed/recompressed every dump,
and from what I've read compression is the current bottleneck.
That's an interesting theory.
But, well, I don't have access to the dump server,
or even to the
toolserver, so I couldn't implement it even if I did have the time.
Couldn't you just set up a test server at home, operating on a reduced
data set?
And last but not least - If the dumps don't work,
then it is very important
to be able to dump some articles with their full histories in other
fashions. I ask my pledge again - do you know who made the block so that
export would only allow for 100 revisions? any way to hack that? Would it be
possible to open an exception to get the data for a research study?
There's an
offset parameter which allows you to get specified revisions or
revision ranges. Read the relevant code in includes/SpecialExport.php
before use, it's a bit counterintuitive (buggy?).
How much are we allowed to use this without getting blocked?
Please don't walk that line, if you stop when a sysadmin notices that
you're slowing down the servers, you've gone way too far. Stick to a
single thread.
-- Tim Starling