[Wikitech-l] Data dump changes

19 Jun 2006

Two minor changes in the process for the data dumps I started earlier today:

* The intermediate "stub" XML dumps are now available for download instead of
vanishing into a temporary directory. These contain all the article and revision
metadata but not the revision text.

* The .7z version of the full-history dump is now built after the .bz2 completes
instead of both at the same time; this should make the .bz2 versions of the big
wikis available for download sooner as it won't have to wait on the slower 7-zip
compressor. (It's still using the slow single-threaded bzip2 for the moment,
though.)

The stub dumps are the same format as the full XML dumps, with the exception
that the <text> element is empty. It has an id attribute (not listed in the XML
Schema file, so don't enforce schema validation in your parser) indicating the
internal storage node which contains that revision's text.

This node number isn't really useful unless you're on our servers as the raw
storage tables are not accessible from outside, but if you want to do statistics
dealing with the rest of the metadata fields it's going to be a lot faster to
deal with these lighter-weight files than the version with full text embedded.

These are compressed with gzip for speed; the stub dump for English Wikipedia
full history runs about 2 gigabytes compressed.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Data dump changes