Ashar Voultoiz wrote:
On Mon, 09 Feb 2004 09:19:04 +0000, Timwi wrote:
Evan Prodromou wrote:
Some (most?) version control systems use a reverse
diff storage
system to keep storage requirements down to a minimum, at the expense
of retrieval time for old versions.
Do we really have to do that? I always thought storage space is
incredibly cheap (at least that's what LiveJournal always say ;-) ). I'm
finding it somewhat irritating that you want to slow down Wikipedia even
more, just to save some disk space.
Timwi
Hello,
Maybe the data will be accessed faster if the database take less space ? I
still can't believe that the mediawiki software store the whole article
each time a modification is made. When the interwiki bot modify something
like 10 000 pages it generates a lot of almost useless datas :( Same for
edit wars that finally get reverted.
It's true that it's cheap to store 30 GB, but it's expensive to move it
around. The advantages of history compression come not only from reduced
seek rate, but also from an improved ability to cache data in RAM.
The method of history compression I'm leaning towards is to concatenate
a number of adjacent revisions together (say 10) and then compress them
using an LZW-like algorithm. The compression algorithm will take
advantage of the similarities between adjacent revisions to produce a
very high rate of compression. Unlike diffs, this kind of compression
works very well with edit wars. In an edit war repeatedly reverting
between two revisions, a diff must store the changes from every edit,
whereas loosely speaking an LZW algorithm must only store each revision
once.
My experiments in history compression can be found at:
http://meta.wikipedia.org/wiki/History_compression
Bzip2 produces much better compression than gzip, but the CPU time
required probably defeats the purpose. Gzip on the other hand seems to
be fast enough for on-the-fly generation, and has acceptable compression
ratios.
A major advantage diff methods have over LZW methods is that they could
potentially be used to generate a human-readable diff for comparing
revisions. Note, however, that most diff algorithms compare
line-by-line, whereas our human-readable diffs compare word-by-word. I'm
not sure if an algorithm can be found which is both useful to humans,
and is fast and has compact output.
The current word-by-word algorithm is not particularly robust --
occasionally a large, well-distributed change is made and the thread
trying to generate the diff hits its 30 second time-limit.
It may be that a diff method can be found which competes with gzip in
terms of CPU time and ratio. The tests have yet to be done.
-- Tim Starling