Re: [Wikitech-l] Question about 2-phase dump

21 Nov 2012

Brion Vibber wrote 2012-11-21 23:20:
...
  While generating a full dump, we're holding the
database connection
 open.... for a long, long time. Hours, days, or weeks in the case of
 English Wikipedia.

 There's two issues with this:
 * the DB server needs to maintain a consistent snapshot of data since 
 when
 we started the connection, so it's doing extra work to keep old data 
 around
 * the DB connection needs to actually remain open; if the DB goes 
 down or
 the dump process crashes, whoops! you just lost all your work.

 So, grabbing just the page and revision metadata lets us generate a 
 file
 with a consistent snapshot as quickly as possible. We get to let the
 databases go, and the second pass can die and restart as many times 
 as it
 needs while fetching actual text, which is immutable (thus no worries 
 about
 consistency in the second pass).

 We definitely use this system for Wikimedia's data dumps! 
Oh, thanks, now I understand!
But the revisions are also immutable - isn't it simpler just to select 
maximum revision ID in the beginning of dump and just discard newer page 
and image revisions during dump generation?

Also, I have the same question about 'spawn' feature of 
backupTextPass.inc :) what is it intended for? :)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Question about 2-phase dump