Alex J. Avriette wrote:
My interest in this project -- mediawiki and
wikiepdia -- started to take a
more serious note when the "great power loss crash" occurred. As a systems
admin
who has been in charge of high availability systems, it shocked me how long it
took to recover. It further shocked me that data had been lost, when I had spent
the entire day adding what I felt was useful content.
What data was lost? As far as I know, nothing at all should have been
lost in that incident except perhaps from the last few seconds prior to
the crash. If you looked at the site while it was read-only during the
playback of logs or shortly thereafter before we cleared the parser
cache you might have seen old versions of pages, but that would have all
been restored by the end of the day.
I mention this because chaper and Jimbo and I have
discussed the availability
of the wikipedia, and also the fact that it has been run largely by developers.
As somebody who has been both a developer and a systems administrator (clever
readers will be able to find my resume online), I can tell you that this is
frequently a very bad idea. That is not to say that developers should not have
the keys to the kingdom, but frequently, a developer does not know that we need
a bigger APC or that we might need APC PDU's, and so on and so forth.
Depends on what you mean by "developers". Many of the "developers"
(such
as Kate and Jamesday) aren't actually the people touching the MediaWiki
code, but are system administrators and DBAs who spend most of their
time and effort on running the server farm, arranging the network,
ordering our new hardware, database admin, etc.
* Power outage at the colo
Kate says we pay for this. This makes it very hard to tolerate failure of that
magnitude. Since then, we still don't have ariel back up as the master database
server. The solution is multiple collocation centers.
Well, additional data centers is on its way. :)
Lastly, Oracle has a product called RAC, their Real
Application Clusters. I
think that (and no I haven't asked them), they may be willing to *give* us
licenses in exchange for being able to use in marketing data "well the
wikipedia,
which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo...
Oracle is unlikely to happen, even if they pay us to use it. There's a
conscious political decision to use FOSS software.
And before I forget to mention it, Postgres is *more
Free* than mysql. I
understand that mediawiki has been coded with mysql in mind, but it might be
possible to begin work on a database-agnostic version of the software that
actually could plug into postgres and we could test things like
cross-continental failover.
Experimental PostgreSQL support already exists, and will be improving as
time goes along.
Another system reliability subject is the lack of
disaster-recovery
documentation.
Lack of sufficient network diagrams. Lack of documentation required for us (me)
to start attacking this from a SYSTEMS point of view. I understand how we work
squid, apache, mysql, the slaves, and mediawiki. Cool. But tell me where the
switches are. What models they are. Which nodes are connected to which switches.
Some of this is on
wp.wikidev.net. If it's not, talk to Kate etc and
make sure it gets done.
-- brion vibber (brion @
pobox.com)