Re: [Wikitech-l] 2006 Apr 20 drastic slowdown postmortem

23 Apr 2006

Tim Starling wrote:
...
  What is "local time"? Please state your
times in UTC. The page you link to
 doesn't go back as far as April 20, and it doesn't appear to have any
 archive links.
  Sorry, had no idea that site didn't keep an archive.

...
  In any case, there's not much point in complaining
about slow response times
 a day after the fact. As I told you before, the best place to contribute to
 this sort of thing is on #wikimedia-tech.

 http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034991.html
  Wouldn't want to bother anybody during an outage, as I'm sure that folks
are busy.  The point of a postmortem is to figure out how to prevent the
same from happening in the future.

Besides, IRC isn't a very conducive for planning, email exchange is much
preferable.

...
  There are no other clusters which fill the same role
as pmtpa. Go to this page:

 http://meta.wikimedia.org/wiki/Profiling/20051208
  For failover, every cluster needs its own copy of the SQL database (slaved),
and its own apache servers, and its own squid.

After all, ISP customers aren't calling support because they cannot edit,
it's because they aren't getting pages served.

...
  and tell me how fast the site would be if every one of
those Database::query
 or memcached::get calls required a couple of transatlantic RTTs. Using
 centralised caches improves the hit rate, and keeping them within a few
 kilometres of the apache servers makes the latency acceptable.
  Strawman.  Are the Tampa apache's using some sort of memcache shared
between them?  Then, how do the Seoul apache's share that?

...
   Also, the DNS
stopped serving inverse addresses.  Compare:  [...]
  That 84.40.24.22 inverse is only at 2 DNServers
both located on the same
 subnet (very bad practice):  
 Maybe you should complain to whoever owns those servers.
  Since they appear to be serving your net, apparently you either own them,
or you are paying for them one way or another.

I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart,
presumably your clusters.  Good practice.

...
   However, that
loss of DNS responses from the same subnet leads to the
 conclusion the subnet might be under congestive collapse.  That is, this
 lag might not be produced by wikimedia itself, but a problem with the
 link to or within the facility.  
 I very much doubt it. Did you try testing for packet loss by pinging a
 Wikimedia server?
  Yes, of course, for most folks that's the first thing to do!  (100% loss.)
Then, traceroutes from various looking glasses to see whether the problem
is path specific.  (Showed a couple of those earlier.)

Again, something caused all the squid and apaches to stop getting bytes
and packets in.  I saved the ganglia .gifs, would you prefer I sent them
as attachments?

...
  Our MRTG stuff is still down following the loss of
larousse, but you can
 still use these:

 http://ganglia.wikimedia.org/
 http://tools.wikimedia.de/~leon/stats/reqstats/
 https://wikitech.leuksman.com/view/Server_admin_log
  That ganglia is RRDTool, which isn't too bad.  Would be nice to see the
interface byte and packet counts for the switches and upstream routers.
That would have told more about the bottleneck, assuming it was a link
issue.  Could have been something else, but hard to know without data.

In this case, the dip shows up on all clusters, even though it probably
only affected Tampa.  That's because all measurement is from one place.

Whenever I've setup a POP, I like to have an NTP chimer, MRTG, and a
separate DNS instance all running (usually on the same box).  That way,
even when the main site is down, the others are still running and
collecting data.  I find that customers may not like the fact the mail
servers are down, but as long as they can still fetch data from elsewhere,
they're less likely to be completely unhappy.

You've got a bastion at several clusters, where would the documentation be
for what you're running at each?

I've looked at https://wikitech.leuksman.com/view/All_servers, but its
hopelessly sparse (and out of date).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] 2006 Apr 20 drastic slowdown postmortem