Note: This web page was automatically created from a PalmOS "pedit32" memo.

Loadleveler problems and resolutions


Today is the third time we've seen loadleveler get wedged.  This time,
killing all of the root-owned LoadL_* processes as root with -9 (-1 did
not work), and then running llctl start on esmf04m, resolved the problem.
2004-11-08.

Happened again 2004-12-10. Same workaround, worked. Had to wait a while after the "lltctl start" for tcp/9605 to start listening again.
2005-07-18 Today is the second time we've seen a loadleveler get stuck in "RP" state (in llq output), which blocks others from being executed. llsubmit is slow, but still accepting new jobs. "RP" state means that the job is being removed, but the removal cannot complete. "llcancel <jobname>" as the loadl user doesn't do much in this case - the job just stays there in RP state "llq -lx" shows that the job is (still?) running on nodes 3 and 5. It may have been running on more than those nodes previously; not sure at this point. This is running on node 3: root 27212 26646 LoadL_startd -f -c /tmp mflanner 27634 27212 LoadL_starter -p 72 -s esmf04m.6858.0 -c /tmp There are no mflanner processes on node 5. I've just kill -HUP'd mark's 27634 process on node 3... jobs are still not being scheduled. Now trying: bash-3.00$ llctl -h esmf03m stop llctl: Sent stop command to host esmf03m bash-3.00$ llctl -h esmf03m stop 07/18 20:23:02 llctl: 2539-463 Cannot connect to esmf03m "LoadL_master" on port 9616. errno = 79 llctl: 2512-183 Error occurred sending stop command to esmf03m bash-3.00$ llctl -h esmf03m start llctl: Attempting to start LoadLeveler on host esmf03m LoadL_master 3.1.0.30 rlyns37a 2005/04/06 AIX 5.1 71 CentralManager = esmf04m bash-3.00$ llstatus thinks both nodes 3 and 5 are idle. Next trying: bash-3.00$ llctl -h esmf05m stop llctl: Sent stop command to host esmf05m bash-3.00$ llctl -h esmf05m stop 07/18 20:25:26 llctl: 2539-463 Cannot connect to esmf05m "LoadL_master" on port 9616. errno = 79 llctl: 2512-183 Error occurred sending stop command to esmf05m bash-3.00$ llctl -h esmf05m start llctl: Attempting to start LoadLeveler on host esmf05m LoadL_master 3.1.0.30 rlyns37a 2005/04/06 AIX 5.1 71 CentralManager = esmf04m bash-3.00$ Now trying the same thing on the third idle node of three, which happens to be the interactive node, which is also the node that handles scheduling to some extent: bash-3.00$ llctl -h esmf04m stop llctl: Sent stop command to host esmf04m bash-3.00$ llctl -h esmf04m start llctl: Attempting to start LoadLeveler on host esmf04m 07/18 20:26:35 TI-1 LoadLeveler is already running on this machine. 07/18 20:26:35 TI-1 The following daemons must be killed before startup can continue: 07/18 20:26:35 TI-1 LoadL_schedd 07/18 20:26:35 TI-1 LoadL_master 07/18 20:26:35 TI-1 LoadLeveler not started. bash-3.00$ A clue. :) kill -1, -2, -15 didn't work for schedd and master, but -9 did. I could then llctl -h esmf04m stop; llctlo -h esmf04m start After this, llsubmitting a short, single-node job began running immediately - no delay from llsubmit either


Back to Dan's palm memos