Tim Thorpe wrote:
What about sticking the entire cluster on private
IP's and having a load
balancing firewall appliance handle traffic flow which would randomly hit
the update box?
There would be several points of failure with such a system.
To make a system really reliable, all single points of failure need to
be removed.
Eg: Building specific hazards- power loss (UPS sometimes fail), network
cables cut, fire, burglary, landlord reposession, hosting company
bankruptcy, malicious attack, human error, plane crash etc.
Machine specific hazards- any single machine failing in the system
bringing everything down- either hardware failure or malicious attack.
Any single segment of the network failing bringing the system down-
hardware failure, human error or malicious attack.
I believe a design philosophy where the system is immune from any single
element failing is both the most cost-effective and the most reliable.
Rather than invest heavily for reliability in mission-critical systems,
make no system mission critical. No system then needs to have
mission-critical investment. The overall system will then be cheaper and
more reliable.
To put it another way:
All systems will fail. The probability of a single reliable costly unit
failing is still fairly high. The probability of many fairly reliable
cheap units with no common point of failure breaking down
simaultaneously is much lower than the probability of a costly reliable
unit failing.
If no single machine is critical and machines are widely separated, we
would not even need to worry whether the machines are equipped with UPS
or redundant supplies.