RE: [Wikitech-l] New system ideas

3 Jan 2004

The obstacle is the DB server, to have an offsite dump 24/7 of the DB server
would effective double the bandwidth used for each transaction. I work for
an ISP N+1 takes care of most DC internal issues, the DC that is being used
is Verio which is a VERY large hosting company, I don't see them going down
any time soon. Some advanced routers have a dial up redundancy capability
where it can phone on a separate land line an off-site router to inform it
that there is a network issue on one end and that all data needs to be
re-routed to the secondary stack.

You can never eliminate all single points of failure, but I also see that
some of us are loosing our heads when it comes to solutions. Some suggested
solutions would cost not in the tens of thousands but hundreds of thousands
to implement.

Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID! ;).

-----Original Message-----
From: wikitech-l-bounces(a)Wikipedia.org
[mailto:wikitech-l-bounces@Wikipedia.org] On Behalf Of Nick Hill
Sent: Friday, January 02, 2004 3:34 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] New system ideas

Tim Thorpe wrote:
...
  What about sticking the entire cluster on private
IP's and having a load
 balancing firewall appliance handle traffic flow which would randomly hit
 the update box? 
There would be several points of failure with such a system.

To make a system really reliable, all single points of failure need to 
be removed.

Eg: Building specific hazards- power loss (UPS sometimes fail), network 
cables cut, fire, burglary, landlord reposession, hosting company 
bankruptcy, malicious attack, human error, plane crash etc.

Machine specific hazards- any single machine failing in the system 
bringing everything down- either hardware failure or malicious attack.

Any single segment of the network failing bringing the system down- 
hardware failure, human error or malicious attack.

I believe a design philosophy where the system is immune from any single 
element failing is both the most cost-effective and the most reliable. 
Rather than invest heavily for reliability in mission-critical systems, 
make no system mission critical. No system then needs to have 
mission-critical investment. The overall system will then be cheaper and 
more reliable.

To put it another way:
All systems will fail. The probability of a single reliable costly unit 
failing is still fairly high. The probability of many fairly reliable 
cheap units with no common point of failure breaking down 
simaultaneously is much lower than the probability of a costly reliable 
unit failing.

If no single machine is critical and machines are widely separated, we 
would not even need to worry whether the machines are equipped with UPS 
or redundant supplies.

_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

RE: [Wikitech-l] New system ideas