Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information

18 Nov 2007

On Nov 18, 2007 8:52 AM, Tim Starling &lt;tstarling(a)wikimedia.org&gt; wrote:
...
  Di (rut) wrote:
  Dear All,
 specially Anthony and Platonides, 
 I hope the heroic duo will give their blessing to this post.
 My name is Anthony DiPierro, and I approve this message :).

...
   I'm not
techy - so why hasn't it been possible to have a non-corrupt dump in
 a long time (that includes history?). A professor of mine asked if the
 problem could be man(person)-power and if it would be interesting/useful to
 have the university help with a programmer to help the dump happen. 
 In my opinion, it would be a lot easier to generate a full dump if it was
 split into multiple XML files for each wiki. Then the job could be
 checkpointed on the file level. Checkpoint/resume is quite difficult with
 the current single-file architecture.

 Tolerant parsers on the client side would help a bit. A dump shouldn't be
 considered "failed" just because it has a region of garbage and some
 unclosed tags in the middle of the file.
 There are a ton of possible solutions.  Do you have access to the dump
server and permission to implement any of them?  That seems to be the
bottleneck.

Personally, I think a good backward-compatible improvement would be to
only regenerate the parts of the bzip2 file which have changed.  Bzip2
resets its compression every 900K or so of uncompressed text, plus the
specification treats the concatenation of two bzip2 files as
decompressing to the same as the bzip2 of the concatenation of the two
uncompressed files (I hope that made sense).  So if the files are
ordered by title then by revision time there should be a whole lot of
chunks which don't need to be uncompressed/recompressed every dump,
and from what I've read compression is the current bottleneck.

But, well, I don't have access to the dump server, or even to the
toolserver, so I couldn't implement it even if I did have the time.

...
   And last but
not least - If the dumps don't work, then it is very important
 to be able to dump some articles with their full histories in other
 fashions. I ask my pledge again - do you know who made the block so that
 export would only allow for 100 revisions? any way to hack that? Would it be
 possible to open an exception to get the data for a research study? 
 There's an offset parameter which allows you to get specified revisions or
 revision ranges. Read the relevant code in includes/SpecialExport.php
 before use, it's a bit counterintuitive (buggy?).
 How much are we allowed to use this without getting blocked?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information