[Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Jamie Morken jmorken at shaw.ca
Wed Mar 17 02:11:44 UTC 2010


Hi,



I think we should keep at least one version of a recent bz2 enwiki 
pages-meta-history file because there are already some programs that use
 the bz2 format directly, and I don't know of any program that uses the 
7z format natively.



heres some offline wiki readers that use the bz2 format:

bzreader:  http://code.google.com/p/bzreader/

mzReader:  http://homepage.ntlworld.com/bharat.vadera/MzReader/

wikitaxi:   http://www.wikitaxi.org/

(note that none of these programs are currently setup for viewing the 
pages-meta-history revision data or discussion pages)



If there is no pages-meta-history in bz2 format available (currently 
280GB for enwiki) then the 7z file will have to be converted to bz2 
unless its possible to interface directly to the 7z file efficiently if 
this is even possible.  Since the 7z file will decompress to 5469GB as 
Kevin showed, I think it would be hard for most people to decompress 
this 7z file, but the 280GB bz2 file is still a reasonable size and can 
be used without decompressing.  So I think keeping at least a single 
recent bz2 file would be the way to go.  The dewiki keeps about 6 of 
their pages-meta-history bz2 files (around 75GB each =450GB storage) 
http://download.wikimedia.org/dewiki/  so I think enwiki should be able 
to keep at least one, especially after all this time of not having any 
of these files for enwiki.



Also I wonder if it is possible to convert from 7z to bz2 without having
 to make the 5469GB file first?  If this can be done then having only 7z
 files would be fine, as the bz2 file could be created with a "normal" 
PC (ie one without a 6TB+ harddrive).  This would be a good solution, 
but not sure if it can be done.  If it could though, might as well get 
rid of all the large wiki's bz2 pages-meta-history files!



cheers,

Jamie



----- Original Message -----
From: Tomasz Finc <tfinc at wikimedia.org>
Date: Tuesday, March 16, 2010 12:45 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
To: Kevin Webb <kpwebb at gmail.com>
Cc: Wikimedia developers <wikitech-l at lists.wikimedia.org>, xmldatadumps-admin-l at lists.wikimedia.org, Xmldatadumps-l at lists.wikimedia.org

> Kevin Webb wrote:
> > I just managed to finish decompression. That took about 54 
> hours on an
> > EC2 2.5x unit CPU. The final data size is 5469GB.
> > 
> > As the process just finished I haven't been able to check the
> > integrity of the XML, however, the bzip stream itself appears 
> to be
> > good.
> > 
> > As was mentioned previously, it would be great if you could compress
> > future archives using pbzib to allow for parallel 
> decompression. As I
> > understand it, the pbzip files are reverse compatible with all
> > existing bzip2 utilities.
> 
> Looks like the trade off is slightly larger files due to 
> pbzip2's 
> algorithm for individual chunking. We'd have to change the
> 
> buildFilters function in http://tinyurl.com/yjun6n5 and install 
> the new 
> binary. Ubuntu already has it in 8.04 LTS making it easy.
> 
> Any takers for the change?
> 
> I'd also like to gauge everyones opinion on moving away from the 
> large 
> file sizes of bz2 and going exclusively 7z. We'd save a huge 
> amount of 
> space doing it at a slightly larger cost during compression. 
> Decompression of 7z these days is wicked fast.
> 
> let know
> 
> --tomasz
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-admin-l/attachments/20100316/dedf5982/attachment-0001.htm 


More information about the Xmldatadumps-admin-l mailing list