On Nov 18, 2007 3:33 PM, Platonides <Platonides(a)gmail.com> wrote:
Anthony wrote:
So if the files are
ordered by title then by revision time there should be a whole lot of
chunks which don't need to be uncompressed/recompressed every dump,
and from what I've read compression is the current bottleneck.
The backup is based on having it sorted by id. Moreover, even changing
that (ie. rewriting most of the code), you'd need to insert in the
middle whenever a page gets a new revision.
It's sorted by page_id, so it's fine. Probably would benefit from
rewriting the code though, at least porting it to C.
You'd rewrite the entire file, just not recompress all of it. Partial
chunks (like at the end of a page_id) would have to be uncompressed
and recompressed, but fortunately the bzip2 spec allows for small
chunks.
If I have free time some weekend I'll throw together a proof of
concept. But for now I think the more pressing issue is allowing
resumption of broken dumps.
As for rsync, I don't see the point. The HTTP protocol allows random
file access.