[Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Kevin Webb kpwebb at gmail.com
Tue Mar 16 21:35:23 UTC 2010


Yeah, same here. I'm totally fine with replacing bzip with 7zip as the
primary format for the dumps. Seems like it solves the space and speed
problems together...

I just did a quick benchmark and got a 7x improvement on decompression
speed using 7zip over bzip using a single core, based on actual dump
data.

kpw



On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik <levmuchnik at gmail.com> wrote:
>
> I am entirely for 7z. In fact, once released, I'll be able to test the XML
> integrity right away - I process the data on the fly, without  unpacking it
> first.
>
>
> On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <tfinc at wikimedia.org> wrote:
>>
>> Kevin Webb wrote:
>> > I just managed to finish decompression. That took about 54 hours on an
>> > EC2 2.5x unit CPU. The final data size is 5469GB.
>> >
>> > As the process just finished I haven't been able to check the
>> > integrity of the XML, however, the bzip stream itself appears to be
>> > good.
>> >
>> > As was mentioned previously, it would be great if you could compress
>> > future archives using pbzib to allow for parallel decompression. As I
>> > understand it, the pbzip files are reverse compatible with all
>> > existing bzip2 utilities.
>>
>> Looks like the trade off is slightly larger files due to pbzip2's
>> algorithm for individual chunking. We'd have to change the
>>
>> buildFilters function in http://tinyurl.com/yjun6n5 and install the new
>> binary. Ubuntu already has it in 8.04 LTS making it easy.
>>
>> Any takers for the change?
>>
>> I'd also like to gauge everyones opinion on moving away from the large
>> file sizes of bz2 and going exclusively 7z. We'd save a huge amount of
>> space doing it at a slightly larger cost during compression.
>> Decompression of 7z these days is wicked fast.
>>
>> let know
>>
>> --tomasz
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Xmldatadumps-admin-l mailing list
>> Xmldatadumps-admin-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
>



More information about the Xmldatadumps-admin-l mailing list