[Labs-l] Cron job concurrency: consider adding `-once` to your cron tasks

Bryan Davis bd808 at wikimedia.org
Wed Aug 2 17:04:06 UTC 2017


We saw a big spike of active Grid Engine jobs starting around
2017-08-01T00:00. I've been looking at the list of active jobs and
noticed that several tools had a lot of copies of the same job
running. There are tools that are designed to have several copies of
the same job running working from a shared queue of some sort, but
often this is a sign that something is wrong with the script.

Here's fancy shell pipeline that will give you a list of all of your
tool's running jobs grouped by job name and sorted by start time:

  qstat -xml |
  tr '\n' ' ' |
  sed 's#<job_list[^>]*>#\n#g' |
  sed 's#<[^>]*>##g' |
  grep " " |
  column -t |
  awk 'BEGIN { OFS="\t" } {print $1, $3, $6, $5}' |
  sort -n -k 3|sort -s -k 2

You can use this to see if you have parallel jobs running and if so
when the "stuck" jobs started. It seems that there may have been some
database related events happening between 2017-07-31T23:00 and
2017-08-01T06:00 that left a bunch of jobs stuck in a bad state
internally.

To keep your cron scheduled jobs from running in parallel, you can add
the `-once` flag to your crontab. Either `jsub -once ...` or `qcronsub
...` will do this for you. When the once flag is active, jsub and
qcronsub will look for jobs that your tool is already running and if
there is an active job with the same name then the new job will *not*
be started and an error message will be logged. The name is either
provided explicitly with `-N ....` or automatically added based on the
command if -N is not used.

(This should probably end up on wikitech in the help somewhere...)

Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855



More information about the Labs-l mailing list