[Labs-l] Queue down

Sun Oct 12 07:44:16 UTC 2014

Cron jobs to the queue may not have started.  You may want to check yours.

My cron jobs did not start up on the queue automatically.  When the load
average for the 15 queue machines were below 1.00 except for two, I have a
feeling many other people's jobs didn't start up either.  (ten are now
above 1.00).  I do have to say thank you... A daily job that normally takes
3 hours only took 40 minutes this time.  I like it when other people's jobs
aren't running :)

I also noticed I had dead jobs from July and August on there.  They
definitely weren't there before.  The one job I had running before the
troubles started was also dead (log files stopped being updated).   I
deleted the dead jobs, manually started one and the cron jobs came to
life.  So, I don't don't if removing the dead ones or manually starting one
got things going.

Bryan

On Sat, Oct 11, 2014 at 10:42 AM, Marc A. Pelletier <marc at uberbox.org>
wrote:

> On 10/11/2014 12:13 AM, Bryan White wrote:
> > The queue seems to be dead.  No cron jobs have started for at least 4
> > hours.  Anything I try, I receive:
>
> There were two corrupt entires in the job database, one of which
> outright /killed/ the gridengine master.  I was able to purge both
> entries, and things are back to normal.
>
> That said, I have no idea how those entires got corrupted to begin with;
> so I'll be keeping a close eye on things.
>
> -- Marc
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20141012/d68b3d12/attachment.html>