Looks great Dan. I put some minor changes in < > brackets. Thanks for taking the notes. JOseph > Here are my notes from earlier today. Anything you care to contribute > to this I'm sure would be valuable. > > Mpc and gradea > > 2004-09-14 11:02:44 am > mpc.uci.edu > mpc.uci.edu/cgi-bin/free-nodes.cgi > > ganglia checks mpc nodes in mach room > > some queues shut down during the day, running only at night > > can call operator to reset nodes in machine room. for nodes outside the > machine room, call Joseph - it's fussy > > pc* and tw* queues are not in mach room > > head node can reboot without interrupting jobs > > uses PBS. if a node dies with a job on it, pbs may get confused, not > utilizing other nodes until the down node is reset > > sharing: 1 cpu jobs can share a node, 2 cpu jobs cannot <2 cpu jobs can share a node or not> > > > the xeons have hyperthreading > > ppn is processes per node > <pbs> > sometimes obs accepts jobs but does not run any. other times it won't > even allow submissions > > if pbs is down, qstat cgi won't report anything > > remove node to fix pbs once every 8 months or so > > all redhat 9 x86/amd64 nodes, will go to rhel later > > mpc.uci.edu/running-jobs.html > > brian benz sp? can reset tw* nodes <name is Ryan William Benz rbenz at uci.edu> > > rsync distribution of /local-mirror, nightly, 2am > > each node has different sized disk, so some things in /local-mirror will > not fit. there are disks as small as 20G > > qsub to submit > > log into head node, then either grab a node or qsub > > mpc.uci.edu/commands.html > > private queues are for the owners, owners can also run in public queues > > submissions with bad geometry are rejected by qsub > > need to add note about bad geom to esmf notes > > errors out if, eg, stderr is not writeable > > /data 1.8 terrabytes, nfs mounted > > throughput only as fast as slowest node - assuming homogenous resource > utilization > > 4 cpu's are shared, remainder are exclusive > > inconsistent nis from head node to compute nodes recently - may be bad > data from registrar or nis building scripts. head node saw a superset > of users on compute nodes <this is for GradEA, not for MPC> > > Joseph initiates reboots about once a week to clear up D state NFS > problems. batch jobs almost always don't need to be restarted > > Joseph really likes pbs, even though it's not perfect > > thinking about running fedora on compute nodes > > compute nodes are iptable'd to disallow incoming connections except from > mpc. compute nodes can get out. > > pc* nodes have a lilo password since they are in labs > > mpc.uci.edu/software.html has a list of compilers, among other things.=20 > Users aren't fussy about compiler upgrades > > mpich and LAM use same protocol, but different API's > > linda is a parallelism lib with only four commands. it's a library > useable from multiple compilers, $5000 and no returns if it doesn't work > > grads must have their PI send Joseph e-mail to use mpc > > /local/etc/run-all.csh uptime > sorted, host list in scrpt > > 'nacs' labeled nodes are opterons > > air nodes are Dabdub > > mpc-data.nacs has /data > > mpc.uci.edu has a dcs account > > tw*'s are blades, others are 1u > > dell, western scientific, appro hardware >