Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information

18 Nov 2007

On Nov 18, 2007 3:33 PM, Platonides &lt;Platonides(a)gmail.com&gt; wrote:
...
  Anthony wrote:
   So if the files are
 ordered by title then by revision time there should be a whole lot of
 chunks which don't need to be uncompressed/recompressed every dump,
 and from what I've read compression is the current bottleneck. 
 The backup is based on having it sorted by id. Moreover, even changing
 that (ie. rewriting most of the code), you'd need to insert in the
 middle whenever a page gets a new revision.
 It's sorted by page_id, so it's fine.  Probably would benefit from
rewriting the code though, at least porting it to C.

You'd rewrite the entire file, just not recompress all of it.  Partial
chunks (like at the end of a page_id) would have to be uncompressed
and recompressed, but fortunately the bzip2 spec allows for small
chunks.

If I have free time some weekend I'll throw together a proof of
concept.  But for now I think the more pressing issue is allowing
resumption of broken dumps.

As for rsync, I don't see the point.  The HTTP protocol allows random
file access.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Why is difficult to have a non-corrupt dump - other ways of getting the information