Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information

18 Nov 2007

On Nov 18, 2007 10:16 AM, Tim Starling &lt;tstarling(a)wikimedia.org&gt; wrote:
...
  Anthony wrote:
  There are a ton of possible solutions.  Do you
have access to the dump
 server and permission to implement any of them?  That seems to be the
 bottleneck. 
 Yes I have access, but I don't have time. You don't need access to the
 dump server to implement improvements, it's all open source:

 http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/

 You can just submit a patch.
 I wasn't aware that the source code to the backup server was open.
Now I guess I can't complain unless I have code :).

...
   Personally, I
think a good backward-compatible improvement would be to
 only regenerate the parts of the bzip2 file which have changed.  Bzip2
 resets its compression every 900K or so of uncompressed text, plus the
 specification treats the concatenation of two bzip2 files as
 decompressing to the same as the bzip2 of the concatenation of the two
 uncompressed files (I hope that made sense).  So if the files are
 ordered by title then by revision time there should be a whole lot of
 chunks which don't need to be uncompressed/recompressed every dump,
 and from what I've read compression is the current bottleneck. 
 That's an interesting theory.

  But, well, I don't have access to the dump
server, or even to the
 toolserver, so I couldn't implement it even if I did have the time. 
 Couldn't you just set up a test server at home, operating on a reduced
 data set?
 Yes, I could.  One thing stopping me has been that I didn't have much
of a clue how the dumps were actually being made.  Now that I know
about the source code, maybe I can do a little better.

I already have most of the random access *reading* completed.  It was
a simple hack to bzip2recover (which has a very small source code
file).  Don't credit me with the idea though, I stole the idea from
Thanassis Tsiodras
(http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html)

...
    Read the relevant code in includes/SpecialExport.php
 before use, it's a bit counterintuitive (buggy?).

 How much are we allowed to use this without getting blocked? 
 Please don't walk that line, if you stop when a sysadmin notices that
 you're slowing down the servers, you've gone way too far. Stick to a
 single thread.
 I can handle an awful lot in a single thread, using the API.  I have
no idea if it'd hurt the server to do so, though.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information