• Light theory
  • Specifics (If all machines in your unrouted cluster are accessible from your head node, then you only need to perform these steps once, on your head node. If you have machines that are only accessible via an additional ssh hop, then you will end up doing this on two or more machines) :
    1. Get the normal nagios plugins archive from the nagios website, and install them under --prefix /usr/local/nagios
    2. Obtain the check_cluster2.c source file from the sourceforge page, and compile it. You may need to add some getopt_long .o's in to get it to link if you're on a unix. Linux typically does not need this. Put your check_cluster2 binary in /usr/local/nagios/libexec, next to the other plugins you already installed.
    3. Set up a dcsew account on your head node, that accepts passwordless, passphraseless ssh logins from the dcsew account on nsc-3.nacs.uci.edu.
    4. If you're using the nagios "check_ping" plugin on RHEL, the default version does not understand RHEL ping output when a host is down. To get around this, either use this patch, or just sftp the binary from nsc-3.nacs:/Web/nagios/libexec/check_ping.
    5. This script may serve as an example of check_cluster2 usage:
      
      #!/usr/local/bin/python
       
      import os
      import string
      import sys
       
      logfile=open('/usr/local/nagios/DCS/loadl.status','w')
      reslist=[]
      # iterate over 1..8
      for i in range(1,9):
              host="esmf0"+str(i)+"m"
      #       just for testing to see what this script will do when there's an
      #       error
      #       if host == 'esmf03m':
      #               port = 3000
      #       else:
      #               port = 9605
              port = 9605
              pipe=os.popen("/usr/local/nagios/libexec/check_tcp -H "+host+" -p "+str(port),"r")
              state_descrip = string.strip(pipe.readline())
              retval = pipe.close()
              # this is a bit strange: Python makes the return value of popen
              # "None" if the
              # exit status is 0
              if retval == None:
                      retval = 0
              sretval = str(retval/256)
              reslist.append(sretval)
              logfile.write(host+' '+sretval+' '+state_descrip+'\n')
       
      logfile.close()
       
      data=string.joinfields(reslist,",")
       
      pipe = os.popen("/usr/local/nagios/libexec/check_cluster2 --service --label 'loadleveler' -w 1 -c 1 -d "+data,"r")
      result_text = pipe.readline()
      status_code = pipe.close()
      # this is a bit strange: Python makes the return value of popen "None"
      # if the
      # exit status is 0
      if status_code == None:
              status_code = 0
      sys.stdout.write(result_text)
      status_code = status_code/256
      sys.exit(status_code)
      
      
    6. Make sure you give your script(s) at least one test run each... And be sure to run them under the dcsew user.
    7. You can find a copy of this script at esmf.ess.uci.edu:/usr/local/nagios/DCS/loadl2. Another example script is esmf.ess.uci.edu:/usr/local/nagios/DCS/nfs.
    8. The script above goes in /usr/local/nagios/DCS/loadl2, but you will most likely have other services you are concerned with so the "loadl2" part may best be changed. Make sure you set the execute bits on your script.
    9. Let Dan know that you have a script on an unrouted cluster, the hostname of the head node, and the full path to the check_cluster2 wrapper script. He will add something to gen-config on nsc-3 to do checks via your script (for now, we may do this an easier way later)
    10. The script above will generate a logfile called /usr/local/nagios/DCS/loadl.status, which will contain the specifics on how each monitored host is doing. So if you get a nagios notice about something being down in your new service, you can check this file to see which hosts are having trouble. Be sure this logfile is owned by the dcsew user, or your script is likely to error out and not give useful results to nagios.
    11. Example logfile content (loadl.status) :
      esmf01m 0 TCP OK - 0 second response time on port 9605
      esmf02m 0 TCP OK - 0 second response time on port 9605
      esmf03m 0 TCP OK - 0 second response time on port 9605
      esmf04m 0 TCP OK - 0 second response time on port 9605
      esmf05m 0 TCP OK - 0 second response time on port 9605
      esmf06m 0 TCP OK - 0 second response time on port 9605
      esmf07m 0 TCP OK - 0 second response time on port 9605
      esmf08m 0 TCP OK - 0 second response time on port 9605
      

      The first column of zeros is the exit status from the check_tcp plugin, which is what gets fed into check_cluster2 by the wrapper script.

    12. The -w 1 -c 1 in the os.system at the bottom of the script indicates that nagios should warn if there are 1 or more nodes not working, and give a critical if one or more nodes are notworking. If the two numbers are the same, it just won't ever warn; it only gives criticals. Also, if you customize this script to check two or more services, you'll probably want to make sure all the exit statuses find their way into the list named "reslist", and also modify what is written to the logfile to include the name of the service.
    13. For your host check, you're probably best off just pinging your head node.
    14. When you're writing nagios plugins, the following exit statuses have the following meanings to nagios:
      • sys.exit(0) = ok
      • sys.exit(1) = warning
      • sys.exit(2) = critical
      • sys.exit(3) = unknown