On Wed, Jan 7, 2009 at 8:31 PM, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:
On Wed, Jan 7, 2009 at 4:43 PM, Robert Rohde
<rarohde(a)gmail.com> wrote:
reduction in size (11.1 GB). Because it is still
a text based format,
it stacks well with traditional file compressors (bz2: 89% reduction -
1.24 GB; 7z: 91% reduction - 1.07 GB).
Ruwiki dumps currently show:
pages-meta-history.xml.7z 1.3 GB
Not really all that much of a win post 7z-ing considering the current
performance numbers you mentioned. (No doubt your code could be made
faster... but at the same time 7z is not the state of the art in raw
compression ratio)
Not that your format wouldn't have many uses... but it doesn't appear
to offer significant gains for bulk transport. (in the future it would
be helpful if you cited the current compressed size when comparing new
compressed sizes)
Yes, you are right about that. For bulk transport and storage it is
not a big improvement.
However, to work with ruwiki, for example, one generally needs to
decompress it to the full 170 GB. To work with enwiki's full revision
history, if such a dump is ever to exist again, would probably
decompress to ~2 TB. 7z and bz2 are not great formats if one wants to
extract only portions of the dump since there are few tools that would
allow one to do so without first reinflating the whole file. Hence,
one of the advantages I see in my format is being able to have a dump
that is still <10% the full inflated size while also being able to
parse out selected articles or selected revisions in a straightforward
manner.
-Robert Rohde