[Xmldatadumps-admin-l] FYI: comparison between enwiki-20100130-pages-meta-history.xml.7z and enwiki-20100312-pages-meta-history.xml.7z

Dmitry Chichkov dchichkov at gmail.com
Fri May 14 02:17:54 UTC 2010


Hi Tomasz,

I did some comparisons between -20100312- [31.9 GB] and -20100130- [15.8 GB]
archives.

0) -20100312- [31.9 GB] archive contains the declared number of revisions
313797035.

    -20100130- [15.8 GB] archive contains only 184777888 revisions.
Last pages/revisions were:
R 184777820 ETA 75.6 : 5137501 53106624 Cote-des-Neiges (Montreal Metro)
R 184777822 ETA 75.6 : 5137502 53106677 Duchcov
R 184777866 ETA 75.6 : 5137504 53106706 Cote-Des-Neiges (Montreal Metro)
R 184777867 ETA 75.6 : 5137506 53106711 Lynn Haney
R 184777888 ETA 75.6 : 5137507 9882553 Wikipedia:Administrators'
noticeboard/Incidents
The xml stream seems to be broken at that point. SyntaxError: no element
found: line 36473988846, column 522


1) For many pages in the archive -20100312- [31.9 GB] revisions between
2005-01-14T and 2005-05-14 have empty text field.
New archive -20100130- [15.8 GB] doesn't seem to have that problem. I
couldn't identify any revisions with missing text in the [15.8 GB] (aside
from blanked pages).

Some statistics on empty text revisions:
[31.9 GB] Revisions 313797035. Empty Revisions 1524837.
[15.8 GB] Revisions 184986173. Empty Revisions 370982
[31.9 GB] Revisions 185000000. Empty Revisions 1158890. (same position in
the the archive)

2) I've analyzed first 500000 revisions (archive enumeration) and could find
any revisions in the [31.9 GB] missing in the [15.8 GB] archive.
3) In the first 500000 revisions texts seems to match exactly (except for
missing texts - see 1.).
4) In the first 500000 revisions comments seems to match exactly.

-- Regards, Dmitry


P.S.
After I've patched pywikipedia.xmlparser to include .7z support and had
fixed memory leaks it seems to work fine with en.wiki archives. You can
actually parse 5TB of text in python :)
Only takes ~36Hrs :) Here is a code snipped printing revisions with empty
texts:
http://wrdese.googlecode.com/svn/trunk/b/verify-wiki-dump-print-empty.py
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-admin-l/attachments/20100513/d42bfceb/attachment-0001.htm 


More information about the Xmldatadumps-admin-l mailing list