[QA] Brief Labs outage

Chris McMahon cmcmahon at wikimedia.org
Fri Aug 1 13:59:10 UTC 2014


On Thu, Jul 31, 2014 at 10:19 PM, Ori Livneh <ori at wikimedia.org> wrote:

> (Apologies for cross-posting.)
>
> We've been noticing an issue with lock-ups on the beta cluster application
> servers for the past few days. It happens about once or twice a day.
>

I think it happens twice a day, once around early afternoon Pacific and
again overnight Pacific. I think the outages correspond with our
twice-daily browser test builds, which are the main source of load on beta
labs.

As I understand it, the symptom is that the apache child processes stop
responding. When child processes stop responding, apache spawns more
children until some maximum number of child apache processes all stop
responding.  At this point the system quickly becomes sluggish and then
just reports 503 errors.  The API URL fails to return anything also.

-Chris




>
> It just happened again on both application servers, and I'd really like to
> try and get to the bottom of things this time. I'll give up and force a
> restart if I haven't figured it out by 6:30 UTC, about an hour from now.
> Please accept my apology if this is disrupting your development or QA work,
> and ping me on IRC if you need Beta back up urgently.
>
> _______________________________________________
> QA mailing list
> QA at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/qa
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/qa/attachments/20140801/fff72f18/attachment.html>


More information about the QA mailing list