Re: [Wikitech-l] 2006 Apr 20 drastic slowdown postmortem

23 Apr 2006

Tim Starling wrote:
...
  William Allen Simpson wrote:
  Wouldn't want to bother anybody during an
outage, as I'm sure that folks
 are busy.  The point of a postmortem is to figure out how to prevent the
 same from happening in the future.  
 Since I still don't have a clue what you're talking about, preventing it
 from happening in the future might be difficult. I'll ask again: what is
 "local time"? Which local time are you talking about?
  Since the thread hasn't had the words "local time" for some time, it
was
hard to figure out your query.  Going back to my first message, there is
a "local time" in parentheses.  In context, it is clear that "last night
(local time)" relates to your data center.  That is, local night.

Had you looked at the graphs (or other logs) at the time, or anytime in
the following day (all I would hope), the data was obvious.  Since you
didn't, I've attached some of the dozen or so that I saved.

The time was 02:30+ UTC.

The load, CPU, and network dropped off the Squids, and the Apaches
(network incoming stayed the same, outgoing dropped), at the same time that
SQL load and CPU leaped (network showed mild incoming decrease and outgoing
increase, the inverse of the apaches).  Told you exactly which servers.

No corresponding note in the admin log.  Perhaps somebody remembers doing
something unusual at the time?

...
  Don't give me this "strawman" crap.
You've been here 2 weeks and you think
 you know the site better than I do? Unless you're willing to treat the
 existing sysadmin team with respect it deserves, I'm not interested in
 dealing with you.
  Never said that I did.  That's why I've been asking questions.  The level
of site documentation is excrable.

However, it would be nicer for you to treat folks offering help with the
respect *they* deserve.  After all, I do happen to have 30+ years of
experience in the field, organized the state government funding for NSFnet
(the academic precursor to the Internet) 20 years ago, was an original
member of the North American Network Operators Group (NANOG), have written
a fair few Internet standards over the years, among other things.

http://www.google.com/search?q=%22William+Allen+Simpson%22

...
  The yaseo apaches serve jawiki, mswiki, thwiki and
kowiki. The memcached
 cluster for those 4 wikis is also located in yaseo. We discussed allowing
 remote apaches to serve read requests from a local slave database, proxying
 write requests back to the location of the master database. The problem is
 that cache writes and invalidations are required even on read requests.
  Yes, this is obvious and well-known.  They are just caches, improvements in
local efficiency.

...
  While distributed shared memory systems with cache
coherency and
 asynchronous write operations have been implemented several times,
 especially in academic circles, I'm yet to find one which is suitable for
 production use in a web application such as MediaWiki. When you take into
 account that certain kinds of cache invalidation must be synchronised with
 database writes and squid cache purges, the problem of distribution, taken
 as a whole, would be a significant project.
  Amazingly, I happen to be sitting just 1 1/2 blocks from one of those
"academic circles", the Center for Information Techology Integration of
the University of Michigan in Ann Arbor, Michigan.

...
  Last year, we discussed the possibility of setting up
a second datacentre
 within the US. But it was clear that centralisation, at least on a per-wiki
 level, gives the best performance for a given outlay, especially when
 development time and manageability are taken into account. Of course this
 performance comes at the expense of reliability. But Domas assured us that
 it is possible to obtain high availability with a single datacentre, as long
 as proper attention is paid to internal redundancy.
  Yes, faster, cheaper, better; pick two (as the old saying goes).

Not knowing "Domas" (or whether that's a name or a company), I'm not
sure
of the basis for the assurance.  Had you checked with other sites, I'm
pretty sure you'd have heard that reliability from a single data center is
extremely unlikely.

...
  With the two recent power failures, it's clear
that proper attention wasn't
 paid, but that's another story.
  No, that's the same old story.  It's practically guaranteed.

...
  Automatic failover to a read-only mirror would be much
easier than true
 distribution, but I don't think we have the hardware to support such a high
 request rate, outside of pmtpa.
  Agreed.  So, it's probably time to think about fixing that problem.

...
  In the end it comes down to a trade-off between costs
and availability.
 Given the non-critical nature of our service, and the nature of our funding,
 I think it's prudent to accept, say, a few hours of downtime once every few
 months, in exchange for much lower hardware, development and management
 costs. If PowerMedium can't provide this level of service despite being paid
 good money, I think we should find a facility that can.
  Agreed.

...
   I've noted
that the ns0, ns1, and ns2 for wikimedia are located far apart,
 presumably your clusters.  Good practice.  
 Don't be patronising.
  So, when I'm asking critical questions, I'm not giving you the respect
you
deserve, but by giving you an "attaboy", I'm patronizing?

Sounds like somebody is lacking some social graces.

I'll just note in passing that the current documentation lists
   https://wikitech.leuksman.com/view/DNS
  * ns0.wikimedia.org - 207.142.131.207 (secondary IP on zwinger)
  * ns1.wikimedia.org - 207.142.131.208 (larousse)
  * ns2.wikimedia.org - 145.97.39.158 (secondary IP on pascal)

You know, that bad practice of having 2 on the same subnet, mentioned a
couple of messages back....  So, the note was supposed to be encouragement,
notwithstanding that the documentation is wrong.  The only reason I know
that it's been improved is by a bit of archeology with dig.

...
  If the external network was down for 20 minutes then
it's PowerMedium's
 problem. They probably lost a router or something. I have better things to
 worry about.
  The external network losses correspond to huge peaks in the MySQL graphs.
So, I doubt you have better things to worry about -- that appears to be
congestive collapse caused by something happening within your servers.

Even the loss of a router or switch or link is of concern, especially
coupled with other problems such as the loss of power.  Not knowing your
SLA, it may be a refund is due.

Anyway, I thought a postmortem was in order....  Professionals do that
kind of thing.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] 2006 Apr 20 drastic slowdown postmortem