[Wikitech-l] History compression (was Re: Reverse diff version control)

12 Feb 2004

Ashar Voultoiz wrote:
...
  On Mon, 09 Feb 2004 09:19:04 +0000, Timwi wrote:

 Evan Prodromou wrote:

 Some (most?) version control systems use a reverse
diff storage
system to keep storage requirements down to a minimum, at the expense
of retrieval time for old versions. 
Do we really have to do that? I always thought storage space is 
incredibly cheap (at least that's what LiveJournal always say ;-) ). I'm 
finding it somewhat irritating that you want to slow down Wikipedia even 
more, just to save some disk space.

Timwi  

 Hello,

 Maybe the data will be accessed faster if the database take less space ? I
 still can't believe that the mediawiki software store the whole article
 each time a modification is made. When the interwiki bot modify something
 like 10 000 pages it generates a lot of almost useless datas :(  Same for
 edit wars that finally get reverted.

It's true that it's cheap to store 30 GB, but it's expensive to move it 
around. The advantages of history compression come not only from reduced 
seek rate, but also from an improved ability to cache data in RAM.

The method of history compression I'm leaning towards is to concatenate 
a number of adjacent revisions together (say 10) and then compress them 
using an LZW-like algorithm. The compression algorithm will take 
advantage of the similarities between adjacent revisions to produce a 
very high rate of compression. Unlike diffs, this kind of compression 
works very well with edit wars. In an edit war repeatedly reverting 
between two revisions, a diff must store the changes from every edit, 
whereas loosely speaking an LZW algorithm must only store each revision 
once.

My experiments in history compression can be found at:

http://meta.wikipedia.org/wiki/History_compression

Bzip2 produces much better compression than gzip, but the CPU time 
required probably defeats the purpose. Gzip on the other hand seems to 
be fast enough for on-the-fly generation, and has acceptable compression 
ratios.

A major advantage diff methods have over LZW methods is that they could 
potentially be used to generate a human-readable diff for comparing 
revisions. Note, however, that most diff algorithms compare 
line-by-line, whereas our human-readable diffs compare word-by-word. I'm 
not sure if an algorithm can be found which is both useful to humans, 
and is fast and has compact output.

The current word-by-word algorithm is not particularly robust -- 
occasionally a large, well-distributed change is made and the thread 
trying to generate the diff hits its 30 second time-limit.

It may be that a diff method can be found which competes with gzip in 
terms of CPU time and ratio. The tests have yet to be done.

-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] History compression (was Re: Reverse diff version control)