[Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Kevin Webb kpwebb at gmail.com
Tue Mar 16 20:10:53 UTC 2010


I just managed to finish decompression. That took about 54 hours on an
EC2 2.5x unit CPU. The final data size is 5469GB.

As the process just finished I haven't been able to check the
integrity of the XML, however, the bzip stream itself appears to be
good.

As was mentioned previously, it would be great if you could compress
future archives using pbzib to allow for parallel decompression. As I
understand it, the pbzip files are reverse compatible with all
existing bzip2 utilities.

Thanks again for all your work on this!
Kevin


On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc <tfinc at wikimedia.org> wrote:
> Tomasz Finc wrote:
>> New full history en wiki snapshot is hot off the presses!
>>
>> It's currently being checksummed which will take a while for 280GB+ of
>> compressed data but for those brave souls willing to test please grab it
>> from
>>
>> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
>>
>> and give us feedback about its quality. This run took just over a month
>> and gained a huge speed up after Tims work on re-compressing ES. If we
>> see no hiccups with this data snapshot, I'll start mirroring it to other
>> locations (internet archive, amazon public data sets, etc).
>>
>> For those not familiar, the last successful run that we've seen of this
>> data goes all the way back to 2008-10-03. That's over 1.5 years of
>> people waiting to get access to these data bits.
>>
>> I'm excited to say that we seem to have it :)
>
> So now that we've had it for a couple of days .. can I get a status
> report from someone about its quality?
>
> Even if you had no issues please let us know so that we start mirroring.
>
> --tomasz
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>



More information about the Xmldatadumps-admin-l mailing list