The site was offline recently for about 20-30 minutes, with some additional
downtime of uploads only, while our upload fileserver amane was broken.
Quick summary of affairs before I run off to dinner:
* amane's mount of izwinger:/home had broken in some way, such that accesses
were hanging
** amane's syslog shows a large number of RPC failures for zwinger's NFS for the
last few hours
* user ssh logins to amane failed due to the broken /home
* lighttpd ran out of connections, with lots of stuck php processes, likely
because thumbnail rendering used files on /home
* amane's nfs server still worked, so the site ran internally
* root ssh login worked, and i was able to kill lighty and remount /home
* however shortly after I tried restarting lighty, it died more thoroughly: i
was unable to continue the ssh session (stuck) and new ssh sessions didn't get
past opening port 22
* at this point amane's nfs died too
* can't find anything in syslog relating to that
* from this point the whole site was broken
* there's a donation link on the error page, which points to a wiki page so it's
also broken
* tried to change the error page to link to the separate fundraising server, but
the update didn't quite take before we finsihed
* we had the colo reboot the machine
* they had to call us back for more info because the machine was not properly
labeled
* amane is not on the serial console server!
* after rebooting, things settled down after a few minutes
* site seems ok at the moment
Recommendations for future:
* make sure all servers are marked
* important machines *must* be on the serial console when installed
* the site should still work if images are offline. check code that works with
image files to make it fail more gracefully
* check NFS mount settings, try to set them up to a more failure-friendly way
and of course
* try to get a backup image server online
* have a way to switch to it automatically
-- brion vibber (brion @
pobox.com)