Tim Starling wrote:
What is "local time"? Please state your
times in UTC. The page you link to
doesn't go back as far as April 20, and it doesn't appear to have any
archive links.
Sorry, had no idea that site didn't keep an archive.
In any case, there's not much point in complaining
about slow response times
a day after the fact. As I told you before, the best place to contribute to
this sort of thing is on #wikimedia-tech.
http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034991.html
Wouldn't want to bother anybody during an outage, as I'm sure that folks
are busy. The point of a postmortem is to figure out how to prevent the
same from happening in the future.
Besides, IRC isn't a very conducive for planning, email exchange is much
preferable.
There are no other clusters which fill the same role
as pmtpa. Go to this page:
http://meta.wikimedia.org/wiki/Profiling/20051208
For failover, every cluster needs its own copy of the SQL database (slaved),
and its own apache servers, and its own squid.
After all, ISP customers aren't calling support because they cannot edit,
it's because they aren't getting pages served.
and tell me how fast the site would be if every one of
those Database::query
or memcached::get calls required a couple of transatlantic RTTs. Using
centralised caches improves the hit rate, and keeping them within a few
kilometres of the apache servers makes the latency acceptable.
Strawman. Are the Tampa apache's using some sort of memcache shared
between them? Then, how do the Seoul apache's share that?
Also, the DNS
stopped serving inverse addresses. Compare:
[...]
That 84.40.24.22 inverse is only at 2 DNServers
both located on the same
subnet (very bad practice):
Maybe you should complain to whoever owns those servers.
Since they appear to be serving your net, apparently you either own them,
or you are paying for them one way or another.
I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart,
presumably your clusters. Good practice.
However, that
loss of DNS responses from the same subnet leads to the
conclusion the subnet might be under congestive collapse. That is, this
lag might not be produced by wikimedia itself, but a problem with the
link to or within the facility.
I very much doubt it. Did you try testing for packet loss by pinging a
Wikimedia server?
Yes, of course, for most folks that's the first thing to do! (100% loss.)
Then, traceroutes from various looking glasses to see whether the problem
is path specific. (Showed a couple of those earlier.)
Again, something caused all the squid and apaches to stop getting bytes
and packets in. I saved the ganglia .gifs, would you prefer I sent them
as attachments?
That ganglia is RRDTool, which isn't too bad. Would be nice to see the
interface byte and packet counts for the switches and upstream routers.
That would have told more about the bottleneck, assuming it was a link
issue. Could have been something else, but hard to know without data.
In this case, the dip shows up on all clusters, even though it probably
only affected Tampa. That's because all measurement is from one place.
Whenever I've setup a POP, I like to have an NTP chimer, MRTG, and a
separate DNS instance all running (usually on the same box). That way,
even when the main site is down, the others are still running and
collecting data. I find that customers may not like the fact the mail
servers are down, but as long as they can still fetch data from elsewhere,
they're less likely to be completely unhappy.
You've got a bastion at several clusters, where would the documentation be
for what you're running at each?
I've looked at
https://wikitech.leuksman.com/view/All_servers, but its
hopelessly sparse (and out of date).