Note: This web page was automatically created from a PalmOS "pedit32" memo.
Loadleveler problems and resolutions
Today is the third time we've seen loadleveler get wedged. This time,
killing all of the root-owned LoadL_* processes as root with -9 (-1 did
not work), and then running llctl start on esmf04m, resolved the problem.
2004-11-08.
Happened again 2004-12-10. Same workaround, worked. Had to wait a
while after the "lltctl start" for tcp/9605 to start listening again.
2005-07-18
Today is the second time we've seen a loadleveler get stuck in "RP" state
(in llq output), which blocks others from being executed. llsubmit is
slow, but still accepting new jobs.
"RP" state means that the job is being removed, but the removal cannot
complete.
"llcancel <jobname>" as the loadl user doesn't do much in this
case - the job just stays there in RP state
"llq -lx" shows that the job is (still?) running on nodes 3 and 5.
It may have been running on more than those nodes previously; not sure
at this point.
This is running on node 3:
root 27212 26646 LoadL_startd -f -c /tmp
mflanner 27634 27212 LoadL_starter -p 72 -s esmf04m.6858.0 -c /tmp
There are no mflanner processes on node 5.
I've just kill -HUP'd mark's 27634 process on node 3...
jobs are still not being scheduled.
Now trying:
bash-3.00$ llctl -h esmf03m stop
llctl: Sent stop command to host esmf03m
bash-3.00$ llctl -h esmf03m stop
07/18 20:23:02 llctl: 2539-463 Cannot connect to esmf03m "LoadL_master"
on port 9616. errno = 79
llctl: 2512-183 Error occurred sending stop command to esmf03m
bash-3.00$ llctl -h esmf03m start
llctl: Attempting to start LoadLeveler on host esmf03m
LoadL_master 3.1.0.30 rlyns37a 2005/04/06 AIX 5.1 71
CentralManager = esmf04m
bash-3.00$
llstatus thinks both nodes 3 and 5 are idle.
Next trying:
bash-3.00$ llctl -h esmf05m stop
llctl: Sent stop command to host esmf05m
bash-3.00$ llctl -h esmf05m stop
07/18 20:25:26 llctl: 2539-463 Cannot connect to esmf05m "LoadL_master"
on port 9616. errno = 79
llctl: 2512-183 Error occurred sending stop command to esmf05m
bash-3.00$ llctl -h esmf05m start
llctl: Attempting to start LoadLeveler on host esmf05m
LoadL_master 3.1.0.30 rlyns37a 2005/04/06 AIX 5.1 71
CentralManager = esmf04m
bash-3.00$
Now trying the same thing on the third idle node of three, which happens
to be the interactive node, which is also the node that handles scheduling
to some extent:
bash-3.00$ llctl -h esmf04m stop
llctl: Sent stop command to host esmf04m
bash-3.00$ llctl -h esmf04m start
llctl: Attempting to start LoadLeveler on host esmf04m
07/18 20:26:35 TI-1 LoadLeveler is already running on this machine.
07/18 20:26:35 TI-1 The following daemons must be killed before startup
can continue:
07/18 20:26:35 TI-1 LoadL_schedd
07/18 20:26:35 TI-1 LoadL_master
07/18 20:26:35 TI-1 LoadLeveler not started.
bash-3.00$
A clue. :)
kill -1, -2, -15 didn't work for schedd and master, but -9 did. I could
then llctl -h esmf04m stop; llctlo -h esmf04m start
After this, llsubmitting a short, single-node job began running
immediately - no delay from llsubmit either