William Allen Simpson wrote:
Wouldn't want to bother anybody during an outage,
as I'm sure that folks
are busy. The point of a postmortem is to figure out how to prevent the
same from happening in the future.
Since I still don't have a clue what you're talking about, preventing it
from happening in the future might be difficult. I'll ask again: what is
"local time"? Which local time are you talking about?
Besides, IRC isn't a very conducive for planning,
email exchange is much
preferable.
There are no other clusters which fill the same
role as pmtpa. Go to this page:
http://meta.wikimedia.org/wiki/Profiling/20051208
For failover, every cluster needs its own copy of the SQL database (slaved),
and its own apache servers, and its own squid.
After all, ISP customers aren't calling support because they cannot edit,
it's because they aren't getting pages served.
and tell me how fast the site would be if every
one of those Database::query
or memcached::get calls required a couple of transatlantic RTTs. Using
centralised caches improves the hit rate, and keeping them within a few
kilometres of the apache servers makes the latency acceptable.
Strawman. Are the Tampa apache's using some sort of memcache shared
between them? Then, how do the Seoul apache's share that?
Don't give me this "strawman" crap. You've been here 2 weeks and you
think
you know the site better than I do? Unless you're willing to treat the
existing sysadmin team with respect it deserves, I'm not interested in
dealing with you.
The yaseo apaches serve jawiki, mswiki, thwiki and kowiki. The memcached
cluster for those 4 wikis is also located in yaseo. We discussed allowing
remote apaches to serve read requests from a local slave database, proxying
write requests back to the location of the master database. The problem is
that cache writes and invalidations are required even on read requests.
While distributed shared memory systems with cache coherency and
asynchronous write operations have been implemented several times,
especially in academic circles, I'm yet to find one which is suitable for
production use in a web application such as MediaWiki. When you take into
account that certain kinds of cache invalidation must be synchronised with
database writes and squid cache purges, the problem of distribution, taken
as a whole, would be a significant project.
Last year, we discussed the possibility of setting up a second datacentre
within the US. But it was clear that centralisation, at least on a per-wiki
level, gives the best performance for a given outlay, especially when
development time and manageability are taken into account. Of course this
performance comes at the expense of reliability. But Domas assured us that
it is possible to obtain high availability with a single datacentre, as long
as proper attention is paid to internal redundancy.
With the two recent power failures, it's clear that proper attention wasn't
paid, but that's another story.
Automatic failover to a read-only mirror would be much easier than true
distribution, but I don't think we have the hardware to support such a high
request rate, outside of pmtpa.
In the end it comes down to a trade-off between costs and availability.
Given the non-critical nature of our service, and the nature of our funding,
I think it's prudent to accept, say, a few hours of downtime once every few
months, in exchange for much lower hardware, development and management
costs. If PowerMedium can't provide this level of service despite being paid
good money, I think we should find a facility that can.
I've noted that the ns0, ns1, and ns2 for
wikimedia are located far apart,
presumably your clusters. Good practice.
Don't be patronising.
However, that loss of DNS responses from the same
subnet leads to the
conclusion the subnet might be under congestive collapse. That is, this
lag might not be produced by wikimedia itself, but a problem with the
link to or within the facility.
I very much doubt it. Did you try testing for packet loss by pinging a
Wikimedia server?
Yes, of course, for most folks that's the first thing to do! (100% loss.)
Then, traceroutes from various looking glasses to see whether the problem
is path specific. (Showed a couple of those earlier.)
Again, something caused all the squid and apaches to stop getting bytes
and packets in. I saved the ganglia .gifs, would you prefer I sent them
as attachments?
If the external network was down for 20 minutes then it's PowerMedium's
problem. They probably lost a router or something. I have better things to
worry about.
You've got a bastion at several clusters, where
would the documentation be
for what you're running at each?
I've looked at
https://wikitech.leuksman.com/view/All_servers, but its
hopelessly sparse (and out of date).
If it's not there then it probably doesn't exist.
-- Tim Starling