On Jun 7, 2004, at 9:08 AM, Ulrich Fuchs wrote:
Furthermore our problem shouldn't be the hardware
any more. When we
are not
able to get a simple second database server to work since a half year
now,
it's not the hardware. The server wasn't brought down by too much
traffic.
Either we by the wrong hardware, or it's badly configured, or we have
the
wrong software because it doesn't scale. Then we should think about
software,
not about hardware. The bottlenecks is not the hardware. Sorry to say
this,
but there was enough money to get this thing running. Probably we
spent it
the wrong way.
Maybe the money was spent in the right way, but contingency planning
failed us. Unlike most of my work in private sector doing failure
recovery planning, I haven't seen a clear set of plans, goals,
triggers, and timelines for bringing up cold spares, for deciding how
much data loss is acceptable, for mirroring data for fastest recovery,
for adding new hardware, etc.
If suda (literally) caught on fire, was there an understood, written
plan for recovery? What was considered the acceptable downtime? Was the
scenario tested? What software/hardware systems exist to handle rack
fires? How about explosions at data centers? 300% surges in traffic
over 24 hours? What is the planned, formal, command chain for decision
making during crisis? Have all the decisions been made already, so the
command chain is not a problem? Has a much more expensive 15 minutes of
total downtime per catastrophic event been budgeted for, rather than a
much cheaper 16 hours per event?
Maybe I'm wrong, and this multi-hour outage is exactly the planned for,
and expected, result from recent events, and once we started having
problems, the policy and procedures kicked in. I kind of doubt it,
though.
-Bop