Brion Vibber wrote 2012-11-21 23:20:
While generating a full dump, we're holding the
database connection
open.... for a long, long time. Hours, days, or weeks in the case of
English Wikipedia.
There's two issues with this:
* the DB server needs to maintain a consistent snapshot of data since
when
we started the connection, so it's doing extra work to keep old data
around
* the DB connection needs to actually remain open; if the DB goes
down or
the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a
file
with a consistent snapshot as quickly as possible. We get to let the
databases go, and the second pass can die and restart as many times
as it
needs while fetching actual text, which is immutable (thus no worries
about
consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
Oh, thanks, now I understand!
But the revisions are also immutable - isn't it simpler just to select
maximum revision ID in the beginning of dump and just discard newer page
and image revisions during dump generation?
Also, I have the same question about 'spawn' feature of
backupTextPass.inc :) what is it intended for? :)