[Labs-l] Outage of labs in progress (resolved)

Fri Nov 7 03:15:49 UTC 2014

On 11/06/2014 05:13 PM, Pine W wrote:
> It will be interesting to see the post-action report and recommendations
> for prevention, if possible.

There is, in the end, very little that can be done to prevent freak
failures of the sort; they are thankfully rare but basically impossible
to predict.

The disk shelves have a lot of redundancy, but the two channels can be
used either to multipath to a single server, or to wire two distinct
servers; we chose the latter because servers - as a whole - have a lot
more moving parts and a much shorter MTBF.  This makes us more
vulnerable to the rarer failure of the communication path, and much less
vunlerable to the server /itself/ having a failure of some sort.

This time, we were just extremely unlucky.  Cabling rarely fails if it
worked at all, and the chances that one would suddenly stop working
right after a year of use is ridiculously low.  This is why it took
quite a bit of time to even /locate/ the fault: we tried pretty much
everything /else/ first given how improbable a cable fault is.  The
actual fix took less than 15 minutes all told; the roughly three hours
prior were spent trying to find the fault everywhere else first.

I'm not sure there's anything we could have done differently, or that we
should do differently in the future.  We were able to diagnose the
problem at all because we had pretty much all the hardware in double at
the DC, and had we not isolated the fault we could still have fired up
the backup server (once we had eleminated the shelves themselves as
being faulty).

The only thing we're missing right now is a spare disk enclosure; if we
had had a failed shelf we would have been stuck having to wait for a
replacement from the vendor rather than being able to simply swap the
hardware on the spot.  That's an issue that I will raise at the next
operations meeting.

-- Marc