On 5/31/06, Brion Vibber <brion(a)pobox.com> wrote:
Between other things I've been working on a
distributed bzip2 compression tool
which could help speed up generation of data dumps.
Alternatively, have you considered generating deltas? (Sorry if this
has been brought up before...)
It seems to me there are two main consumption cases of the wikipedia data:
- one-off copies ("most recent" doesn't really matter)
- mirrors (will want to continually update)
If you did a full snapshot once a month, and then daily/weekly deltas
on top of that, you could maybe save yourself both processing time and
external bandwidth.