Re: [Wikitech-l] History compression (was Re: Reverse diff version control)

12 Feb 2004

Tim Starling wrote:
...
  The method of history compression I'm leaning
towards is to concatenate 
 a number of adjacent revisions together (say 10) and then compress them 
 using an LZW-like algorithm. The compression algorithm will take 
 advantage of the similarities between adjacent revisions to produce a 
 very high rate of compression. Unlike diffs, this kind of compression 
 works very well with edit wars. In an edit war repeatedly reverting 
 between two revisions, a diff must store the changes from every edit, 
 whereas loosely speaking an LZW algorithm must only store each revision 
 once. 
Hm. The problem with the "diff" you are describing is easily solved by `trying
diffs not only with the previous version, but, if that diff proves to be large, also
trying to make diffs with a few older versions. The smallest diff could then be stored in
combination with a pointer.

We should consider how often the compression is performed, and how often we need to
decompress. e.g. for a disk backup operation commonly compression algorithms are used that
are very light in compression, but heavy on decompression: one does not normally use the
backup anyway, only in disaster recovery.

For us, however, I think that the average difference is viewed more than once, so the
decompression is called a little but not a lot more often than the compression. Like in
RCS, it is much more likely that recent versions are requested than ancient versions, so a
cascading difference would work. I would hesitate to compress multiple versions into one
blob, because it would make the database tables a lot more complicated (how to refer to a
single old version?) and both compression and decompression quite expensive (every edit
will mean a decompress of some recent versions, appending of new data, and
recompression).

If the database structure has one entry per old version, it would even be possible to have
a daemon perform the compression of the old tables asynchronously. New entries in the old
table would be stored uncompressed, but, whenever the server is quiet, a daemon will hop
around in the tables replacing entries in the old table with differences or compressed
versions, whichever is more appropriate for the particular case.

Rob

-- 
Rob W.W. Hooft  ||  rob(a)hooft.net  ||  http://www.hooft.net/people/rob/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] History compression (was Re: Reverse diff version control)