Wouldn't it be a good idea to put things such as emails and stats updates
into the job queue? (all stats updates could be under one job type, with
just a parameter to decide what).
Then the slowness would be handled by the job runners, letting edits come
through quickly. Since we're not doing it in-transaction anyway, there
shouldn't be a big problem with it (we could probably do the same for
logging, although it's not as important).
Assuming the job runners properly free connections, they shouldn't have any
open connections except the one they are currently using to update the stats
(and in the case of emails, no db connections at all if we pass the data
through in parameters, or if we connect, grab it, then disconnect before
even starting the email).
This would probably help lower the cost of stats updates, and stop emails
from holding DB connections at all. It's probably a bit of treating the
symptoms not the problem, but it would work for now.
- mattj
--------------------------------------------------
From: "Tim Starling" <tstarling(a)wikimedia.org>
Sent: Thursday, September 25, 2008 3:18 PM
To: <wikitech-l(a)lists.wikimedia.org>
Subject: Re: [Wikitech-l] Page saving slowness and some loading
breakagetoday
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Posted this summary on blog, going out to
en.planet.wikimedia.org...
http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of
things:
1. There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page
saves very slow at times
2. … which held open database connections, causing the text storage
servers to start locking out new connections …
3. … which exacerbated problems with the failover behavior of recent
changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the
ES changes is because they weren't the issue, and the fact that ES master
went down first allowed the site to continue in read-only mode. You could
have just increased the max connections on the ES masters, for the same
effect. The connection count on the core master would have overflowed
instead.
But I did think I had found the root cause of the problem at the time,
obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to
progress towards higher availability. I think a better way to fix the
site_stats contention would have been to insert an unconditional COMMIT in
SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a
symptom of a much larger problem), then that can be mitigated by closing
the connections early. But I think the only reason we're seeing this come
out on the ES servers is because they have the lowest number of maximum
connections, so they fail first.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l