ESMF in a nutshell - information for admins

There are 8 compute nodes, all running AIX 5.1. esmf01m-07m have 8 cpu's, esmf08m has 32 cpu's. There's also esmfcws (which is where accounts are created, using /usr/local/etc/addacct (with and without -r, for now)), and esmfhmc (which is like a console server, and is rarely used). /usr/local is changed on esmf04m and then pushed to the other machines with /usr/local/etc/copy-usr-local. esmft1 doesn't do that much these days. :) esmft2 is for Lustre testing as a Lustre client and NFS server. Then there are esmfsn01, 02, 03, 04 and 05, which are for Lustre. Then there's esmfgw, which is just to facilitate patching - and Saska knows more about this machine than I do, so ask him. :)

GFS is gone! Don't worry about it. :)

Lustre is in a transitional state right now. Don't expect it to be up for the time being.

The old GFS procedures document is here, but you probably don't need it - at least, not for the ESMF.

/ptmp on the esmf is for relatively fast read/write, but isn't as large as /data (which should be Lustre before long). We recently set up user and group quotas on /ptmp. It's local to esmf04m, and shared out over NFS to the other compute nodes. If /ptmp runs horribly low on space, move a big user to /ptmp2 and symlink.

Nagios warnings for the ESMF are pretty coarse-grained. If you see a red from the ESMF, look at /usr/local/nagios/DCS/*.status to identify the real problem. Or alternatively, run "/usr/local/nagios/DCS/what-is-wrong?", and it should list the things that are wrong with the ESMF currently, assuming that nagios isn't in the middle of checking on something.

dsh -a is your friend. It's kind of like srsh. You can "dsh -a df" to get filesystem usage across all of the compute nodes quickly and easily.

The compute cluster itself has three networks: the "m" addresses like esmf04m, the "s" addresses like "esmf04s", and the "d" addresses like "esmf04d". The m's are 100BaseT, the d's are 1000BaseT, and the s's are IBM SP switch addresses. The s cables are fragile - don't touch. If there's a problem with one or more "s" addresses, try to unfence, and failing that, call IBM (1-800-IBM-SERV).

The esmf has routing discontinuities. From esmf.ess.uci.edu (aka esmf04[mds]), you can reach esmfhmc, esmfcws, esmf0[1-8][mds], and esmft[12]. From esmft[12], you can reach esmfsn0[1-5].

The main applications on the ESMF are loadleveler, the compilers, and CCSM 2 & 3 (aka CSM, of which "CAM" is a part).

If loadleveler has problems, llq -s <jobid> is your friend. Also, sometimes jobs get stuck because they cannot write to a file, or because of an NFS timeout on one of the compute nodes. They can also get stuck because a bad "geometry" (combination) of nodes+cpu's was requested that cannot be satisfied by the currently-available resources of the cluster. You can also llctl -g reconfig to bounce loadleveler. llctl -g stop && llctl -g start should be a last resort, as it appears to start jobs over.

The compilers not only need to be functional, but also need to produce repeatable results. If they are upgraded, it is important to announce that in advance, and afterward, so folks will know that their jobs may produce different results than they used to.

CCSM issues are mostly up to the people working with the software. However, a good test of overall system functionality is whether you can compile CCSM 3 and run it.

I have a cobbled version of CCSM at ~strombrg/CCSM3. To actually build it and run it, just cd to this directory as strombrg, and run "cobble". At the end of the cobble, it'll llsubmit CCSM to loadleveler. You can check on the loadleveler job's status, you can use "llq". Once the job is done, llq won't list the job anymore. You can then look at the *std* files to see what the output was. There will also be lots of output under /ptmp/strombrg.

Folks in a primary group that ends with a "2" (EG: zender2, frankli2) are not supposed to get as much assistance as folks in a group without the "2". The "2" people are from outside of ESS; the non-2 people are in ESS.

If you need to build an initrd for a new lustre kernel, you can use something like:

mkinitrd /boot/initrd-2.4.21-15.0.4.EL.lustre10.img 2.4.21-15.0.4.EL.lustre10 --with-module 3w-xxxx --with-module mptscsih --with-module mptbase --with-module scsi_mod

For info on rebooting ESMF-related systems, check here.