Re: [Wikitech-l] Five points we should be discussion about the mediawiki projects

16 Mar 2005

Alex J. Avriette wrote:
...
    My interest in this project -- mediawiki and
wikiepdia -- started to take a
 more serious note when the "great power loss crash" occurred. As a systems
admin
 who has been in charge of high availability systems, it shocked me how long it
 took to recover. It further shocked me that data had been lost, when I had spent
 the entire day adding what I felt was useful content. 
What data was lost? As far as I know, nothing at all should have been
lost in that incident except perhaps from the last few seconds prior to
the crash. If you looked at the site while it was read-only during the
playback of logs or shortly thereafter before we cleared the parser
cache you might have seen old versions of pages, but that would have all
been restored by the end of the day.

...
    I mention this because chaper and Jimbo and I have
discussed the availability
 of the wikipedia, and also the fact that it has been run largely by developers.
 As somebody who has been both a developer and a systems administrator (clever
 readers will be able to find my resume online), I can tell you that this is
 frequently a very bad idea. That is not to say that developers should not have
 the keys to the kingdom, but frequently, a developer does not know that we need
 a bigger APC or that we might need APC PDU's, and so on and so forth. 
Depends on what you mean by "developers". Many of the "developers"
(such
as Kate and Jamesday) aren't actually the people touching the MediaWiki
code, but are system administrators and DBAs who spend most of their
time and effort on running the server farm, arranging the network,
ordering our new hardware, database admin, etc.

...
  * Power outage at the colo
   Kate says we pay for this. This makes it very hard to tolerate failure of that
 magnitude.  Since then, we still don't have ariel back up as the master database
 server. The solution is multiple collocation centers. 
Well, additional data centers is on its way. :)

...
    Lastly, Oracle has a product called RAC, their Real
Application Clusters. I
 think that (and no I haven't asked them), they may be willing to *give* us
 licenses in exchange for being able to use in marketing data "well the
 wikipedia,
 which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo...

Oracle is unlikely to happen, even if they pay us to use it. There's a
conscious political decision to use FOSS software.

...
    And before I forget to mention it, Postgres is *more
Free* than mysql. I
 understand that mediawiki has been coded with mysql in mind, but it might be
 possible to begin work on a database-agnostic version of the software that
 actually could plug into postgres and we could test things like
 cross-continental failover. 
Experimental PostgreSQL support already exists, and will be improving as
time goes along.

...
    Another system reliability subject is the lack of
disaster-recovery
 documentation.
 Lack of sufficient network diagrams. Lack of documentation required for us (me)
 to start attacking this from a SYSTEMS point of view. I understand how we work
 squid, apache, mysql, the slaves, and mediawiki.  Cool. But tell me where the
 switches are. What models they are. Which nodes are connected to which switches. 
Some of this is on wp.wikidev.net. If it's not, talk to Kate etc and
make sure it gets done.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Five points we should be discussion about the mediawiki projects