[Labs-admin] tools: main page (and possibly other tools) just died, then (after about half an hour) restarted itself (?)

Bryan Davis bd808 at wikimedia.org
Mon Nov 21 01:08:55 UTC 2016


On Sun, Nov 20, 2016 at 12:32 AM, Alex Monk <amonk at wikimedia.org> wrote:
> [06:33:00] <icinga-wm> PROBLEM - tools homepage -admin tool- on
> tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not
> Available - 531 bytes in 0.021 second response time
> [06:34:03] <shinken-wm> PROBLEM - ToolLabs Home Page on toollabs is
> CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - string
> 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 531 bytes in 0.031
> second response time
>
> I started looking into this
> * Checked a couple of tools, other things e.g. GUC appear up (so didn't SMS
> any ops as I'm not sure the main page is that important)
> * Found it runs on the grid and tried `qmod -rj lighttpd-admin`
> * It appears up after this, but only briefly, then it's gone again
> * I try to figure out how to start it
> * Attempted 'webservice start', which looked OK, but 'webservice status'
> would always say 'Your webservice is not running'
> * ~07:13:24ish - it mysteriously appears online again
> * 07:16:52 - Matthew Bowker informs me that xTools was down too (no
> monitoring from shinken or icinga alerted IRC of this, but possibly
> connected) - he says the error from 'webservice restart' was
> https://www.irccloud.com/pastebin/w6AfLja7/
>
> I was looking at /data/project/.system/gridengine/spool/qmaster/messages
> while this was happening, I see quite a few 'host
> "tools-cron-01.tools.eqiad.wmflabs" is no admin host' errors in there though
> I have no reason to believe that's connected.

The error.log for tools.admin is full of nasty looking stuff starting
at 2016-11-20 06:30:51.

2016-11-20 06:30:51: (mod_fastcgi.c.1733) connect failed: Permission
denied on unix:/var/run/lighttpd/php.socket.admin-1
2016-11-20 06:30:51: (mod_fastcgi.c.2999) backend died; we'll disable
it for 1 seconds and send the request to another backend instead:
reconnects: 0 load: 1

Messages like these repeat with the socket alternating between admin-1
and admin-0 often multiple times per second (likely for every hit to
the app) until:

2016-11-20 06:48:13: (server.c.1558) server stopped by UID = 0 PID = 14694
2016-11-20 06:49:11: (server.c.1558) server stopped by UID = 0 PID = 29691
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py",
line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/admin'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line
138, in apport_excepthook
    os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
OSError: [Errno 2] No such file or directory:
'/var/crash/_usr_bin_webservice-runner.51051.crash'

Original exception was:
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py",
line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/admin'
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py",
line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/admin'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line
138, in apport_excepthook
    os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
OSError: [Errno 2] No such file or directory:
'/var/crash/_usr_bin_webservice-runner.51051.crash'

Errors like this repeat until:

2016-11-20 07:13:00: (log.c.166) server started

So ... yeah. Something went nuts with /var/run on the exec node maybe?

Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855



More information about the Labs-admin mailing list