Evan Martin wrote:
On 5/31/06, Brion Vibber <brion(a)pobox.com>
wrote:
Between other things I've been working on a
distributed bzip2 compression tool
which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this
has been brought up before...)
Many times, but that's not necessarily clear or simple. The generic
delta-generation tools we've tried in the past just choke on our files; note
that the full-history dump of English Wikipedia -- the one we're most concerned
about having archival copies of available -- is over 350 gigabytes uncompressed.
(Clean XML-wrapped text with no scary internal compression or diffing, and a
well-known standard compression format on the outside, is a simple and
relatively future-proof for third-party textual analysis and reuse and long-term
archiving.)
Something application-specific might be possible.
It seems to me there are two main consumption cases of
the wikipedia data:
- one-off copies ("most recent" doesn't really matter)
- mirrors (will want to continually update)
If you did a full snapshot once a month, and then daily/weekly deltas
on top of that, you could maybe save yourself both processing time and
external bandwidth.
Even if I only did full snapshots a quarter as often, I'd still want them to
take two days instead of ten. :)
-- brion vibber (brion @
pobox.com)