[Labs-l] Unable to delete jobs

Marc A. Pelletier marc at uberbox.org
Sun Nov 2 19:20:13 UTC 2014


On 11/01/2014 05:28 PM, Rohit Dua wrote:
> From the past few days, I am unable to delete my tool jobs on the grid
> using jstop or qdel -f via ssh The jobs tend to go into dr mode(and keep
> running). This has stopped my tool from working.(as I need it to
> stop/restart it with control)

There was an issue caused by the switch from SIGKILL to SIGINT to kill
jobs - which was necessary because SIGKILL, while reliable, killed jobs
instantly without giving them an opportunity to clean after themselves
(which was necessary for web services and some other jobs).

There is a new process in place for killing jobs: first, the process
group is sent SIGINT, there is a delay of up to 10 seconds, then SIGTERM
is sent to the process group.  If the job process or any of its
descendents are still alive after another ten seconds, then they are all
individually killed with SIGKILL.  Please be aware that this last
SIGKILL will kill every process that was a child of the job as well as
any parentless process that may be leftover on the grid node owned by
the tool - that may sweep up orphans from /other/ jobs if there are any
(but if there are, it's a bug in your process handling).

-- Marc




More information about the Labs-l mailing list