Austin Hair wrote:
However we are
reading a few bits off of zwinger's NFS (some block lists
etc, some lock files) and sometimes writing (logs). Insofar as those are
currently used they should be either migrated to a more survivable
situation or should be able to fail gracefully. NFS should be set up if
it's not in a way that will fail cleanly after a short timeout.
Linux mount option "soft" will cause an I/O error to be returned after
a "major timeout," the definition of which varies. "intr" in
combination with "hard" will allow the program to respond to signals,
which is in most cases preferable to having an uninterruptable process
sitting there until reboot.
We mount NFS with soft and timeo=14. I imagine retrans is at its default
value of 3, so if I understand the manual correctly, that gives a major
timeout of 9.8 seconds. That would be consistent with what we saw in the
crash -- most apps don't seem to abort when they get one of these
timeouts, they just treat it as an ordinary read error and continue
their execution. It's not surprising that everything locked up,
including root logins.
What about using a detachable filesystem like Coda, or a spare NFS
server with automatic failover?
-- Tim Starling