Dan Collins wrote:
Apparently, for whatever reason, the master database
server for enwiki
got overloaded. This was following a few updates, which may have (I
don't think they're sure yet) caused the problem. What actually
happened was the database server had a large number of queries stuck
in the 'statistics' status, leading to overload, leading to wiki down.
Enwiki was set to read only, and the Almighty Tim, Patron Saint of
Master Databases, arrived on the scene to heroically run the
master-database switch script. The S1 (enwiki) master database was
changed from db14 to db16, and db14 was removed from the slave
rotation. From what I understand, db14 will need a swift kick to the
power button to make it all jolly and happy again.
Ah excellent, you did my summary post for me. :)
Lots of threads being in the "statistics" state seems to be MySQL's way
of saying "I've fallen and I can't get up". It's unclear exactly
what
set it off, but basically nothing works well until you restart it.
At 52 minutes from start of event, this took us a bit longer than I'd
like to resolve -- we had to percolate through a couple levels of alert
calls. (Sorry to wake you up early Tim!)
A similar event in future should be fixable within a few minutes, thanks
to Tim's work on making the master-switch system more foolproof. We're
fixing up our internal documentation so all our site ops will now know
how to run the database master switch script next time!
Only
en.wikipedia.org was affected, other than a couple of minutes where
we threw the whole site to read-only while figuring out what was going on.
-- brion