Di (rut) wrote:
Dear All,
specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in
a long time (that includes history?). A professor of mine asked if the
problem could be man(person)-power and if it would be interesting/useful to
have the university help with a programmer to help the dump happen.
See my blog posts discussing this matter:
http://leuksman.com/log/2007/10/02/wiki-data-dumps/
http://leuksman.com/log/2007/10/14/incremental-dumps/
http://leuksman.com/log/2007/10/29/wiki-dumps-in-dump-revision-diffs/
The general problem is that there's a lot of data and compressing it
takes an ungodly amount of time. When it takes forever to run, you're
more likely to hit some cute little error in the middle which causes the
process to fail.
Either we need to make the process more resistant to problems or we need
to speed it up a lot, or both.
Splitting up the dump into smaller pieces which can be checkpointed
(Tim's suggestion), or a recoverable version of the
grab-text-from-the-database subprocess (my suggestion) would allow a
dump run broken by a lost database connection to continue to completion.
(These are not mutually exclusive options.)
The cost of splitting the dump is complication for users -- more files
to fetch, more difficulty for automation, possibly changes to client
scripts required. But it's also a popular idea to have smaller files to
work with in batch.
Replacing thousands-of-revisions-bzipped-or-7zipped-together with a
smarter diff to reduce the amount of slow general-purpose compression
needed to get a decent download size should also reduce the amount of
time it takes to run, making it more likely that a history dump will
continue without hitting an error.
This would involve changing the format, necessitating even more changes
to client software for compatibility.
Alas, this hasn't yet seen all the work done on it that it needs.
Currently we have a programming staff of two (me and Tim) jumping back
and forth between too many projects and our own relocations, and neither
of us has gotten to the finish line on this project yet. Neither has any
other interested party so far.
(Note that the foundation will be hiring a couple more programmers for
2008, as we get the San Francisco office set up.)
Also - now I've got a file from 2006 but I still
wonder if there is no place
where one can access old dumps - these will/could be very important research
wise.
I have a fair number of *old* dumps sitting around at the office, but
I'm not sure if I have any medium-depth ones. We don't generally keep
old dumps up for download, but I could possibly provide an individual
one if needed for research purposes.
And last but not least - If the dumps don't work,
then it is very important
to be able to dump some articles with their full histories in other
fashions. I ask my pledge again - do you know who made the block so that
export would only allow for 100 revisions? any way to hack that? Would it be
possible to open an exception to get the data for a research study?
That was originally done because buffering would cause a longer export
to fail. The export has since been changed so it should skip buffering,
so this possibly could be lifted. I'll take a peek.
-- brion vibber (brion @
wikimedia.org)