The obstacle is the DB server, to have an offsite dump 24/7 of the DB server
would effective double the bandwidth used for each transaction. I work for
an ISP N+1 takes care of most DC internal issues, the DC that is being used
is Verio which is a VERY large hosting company, I don't see them going down
any time soon. Some advanced routers have a dial up redundancy capability
where it can phone on a separate land line an off-site router to inform it
that there is a network issue on one end and that all data needs to be
re-routed to the secondary stack.
You can never eliminate all single points of failure, but I also see that
some of us are loosing our heads when it comes to solutions. Some suggested
solutions would cost not in the tens of thousands but hundreds of thousands
to implement.
Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID! ;).
-----Original Message-----
From: wikitech-l-bounces(a)Wikipedia.org
[mailto:wikitech-l-bounces@Wikipedia.org] On Behalf Of Nick Hill
Sent: Friday, January 02, 2004 3:34 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] New system ideas
Tim Thorpe wrote:
What about sticking the entire cluster on private
IP's and having a load
balancing firewall appliance handle traffic flow which would randomly hit
the update box?
There would be several points of failure with such a system.
To make a system really reliable, all single points of failure need to
be removed.
Eg: Building specific hazards- power loss (UPS sometimes fail), network
cables cut, fire, burglary, landlord reposession, hosting company
bankruptcy, malicious attack, human error, plane crash etc.
Machine specific hazards- any single machine failing in the system
bringing everything down- either hardware failure or malicious attack.
Any single segment of the network failing bringing the system down-
hardware failure, human error or malicious attack.
I believe a design philosophy where the system is immune from any single
element failing is both the most cost-effective and the most reliable.
Rather than invest heavily for reliability in mission-critical systems,
make no system mission critical. No system then needs to have
mission-critical investment. The overall system will then be cheaper and
more reliable.
To put it another way:
All systems will fail. The probability of a single reliable costly unit
failing is still fairly high. The probability of many fairly reliable
cheap units with no common point of failure breaking down
simaultaneously is much lower than the probability of a costly reliable
unit failing.
If no single machine is critical and machines are widely separated, we
would not even need to worry whether the machines are equipped with UPS
or redundant supplies.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l