Re: [Wikitech-l] Page saving slowness and some loading breakagetoday

25 Sep 2008

Wouldn't it be a good idea to put things such as emails and stats updates 
into the job queue? (all stats updates could be under one job type, with 
just a parameter to decide what).

Then the slowness would be handled by the job runners, letting edits come 
through quickly. Since we're not doing it in-transaction anyway, there 
shouldn't be a big problem with it (we could probably do the same for 
logging, although it's not as important).

Assuming the job runners properly free connections, they shouldn't have any 
open connections except the one they are currently using to update the stats 
(and in the case of emails, no db connections at all if we pass the data 
through in parameters, or if we connect, grab it, then disconnect before 
even starting the email).

This would probably help lower the cost of stats updates, and stop emails 
from holding DB connections at all. It's probably a bit of treating the 
symptoms not the problem, but it would work for now.

- mattj

--------------------------------------------------
From: "Tim Starling" &lt;tstarling(a)wikimedia.org&gt;
Sent: Thursday, September 25, 2008 3:18 PM
To: &lt;wikitech-l(a)lists.wikimedia.org&gt;
Subject: Re: [Wikitech-l] Page saving slowness and some loading 
breakagetoday

...
  Brion Vibber wrote:
  -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Posted this summary on blog, going out to en.planet.wikimedia.org...
 http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/

 We’ve tracked down today’s problems to a combination of a couple of 
 things:

    1. There’ve been ongoing database locking issues with the site
 statistics updates — these would all block on each other, making page
 saves very slow at times
    2. … which held open database connections, causing the text storage
 servers to start locking out new connections …
    3. … which exacerbated problems with the failover behavior of recent
 changes to the storage and load balancing code. 
 I did see something like this before, and the reason I didn't revert the
 ES changes is because they weren't the issue, and the fact that ES master
 went down first allowed the site to continue in read-only mode. You could
 have just increased the max connections on the ES masters, for the same
 effect. The connection count on the core master would have overflowed 
 instead.

 But I did think I had found the root cause of the problem at the time,
 obviously I hadn't.

 I think the ES load balancing changes were useful, and are a good way to
 progress towards higher availability. I think a better way to fix the
 site_stats contention would have been to insert an unconditional COMMIT in
 SiteStatsUpdate::doUpdate().

 If the connection count on the ES master really is a problem (not just a
 symptom of a much larger problem), then that can be mitigated by closing
 the connections early. But I think the only reason we're seeing this come
 out on the ES servers is because they have the lowest number of maximum
 connections, so they fail first.

 -- Tim Starling

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l  

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Page saving slowness and some loading breakagetoday