Thanks for your valuable additions.

On Wed, 2004-09-15 at 16:27, Joseph Farran wrote:
Looks great Dan.  I put some minor changes in < > brackets.
Thanks for taking the notes.

JOseph


> Here are my notes from earlier today.  Anything you care to contribute
> to this I'm sure would be valuable.
> 
> Mpc and gradea
> 
> 2004-09-14 11:02:44 am
> mpc.uci.edu
> mpc.uci.edu/cgi-bin/free-nodes.cgi
> 
> ganglia checks mpc nodes in mach room
> 
> some queues shut down during the day, running only at night
> 
> can call operator to reset nodes in machine room.  for nodes outside the
> machine room, call Joseph - it's fussy
> 
> pc* and tw* queues are not in mach room
> 
> head node can reboot without interrupting jobs
> 
> uses PBS.  if a node dies with a job on it, pbs may get confused, not
> utilizing other nodes until the down node is reset
> 
> sharing: 1 cpu jobs can share a node, 2 cpu jobs cannot
 <2 cpu jobs can share a node or not>
> 
> 
> the xeons have hyperthreading
> 
> ppn is processes per node
> 
           <pbs>
> sometimes obs accepts jobs but does not run any.  other times it won't
> even allow submissions
> 
> if pbs is down, qstat cgi won't report anything
> 
>  remove node to fix pbs once every 8 months or so
> 
> all redhat 9 x86/amd64 nodes, will go to rhel later
> 
> mpc.uci.edu/running-jobs.html
> 
> brian benz sp? can reset tw* nodes
 <name is Ryan William Benz rbenz at uci.edu>
> 
> rsync distribution of /local-mirror, nightly, 2am
> 
> each node has different sized disk, so some things in /local-mirror will
> not fit.  there are disks as small as 20G
> 
> qsub to submit
> 
> log into head node, then either grab a node or qsub
> 
> mpc.uci.edu/commands.html
> 
> private queues are for the owners, owners can also run in public queues
> 
> submissions with bad geometry are rejected by qsub
> 
> need to add note about bad geom to esmf notes
> 
> errors out if, eg, stderr is not writeable
> 
> /data 1.8 terrabytes, nfs mounted
> 
> throughput only as fast as slowest node - assuming homogenous resource
> utilization
> 
> 4 cpu's are shared, remainder are exclusive
> 
> inconsistent nis from head node to compute nodes recently - may be bad
> data from registrar or nis building scripts.  head node saw a superset
> of users on compute nodes
 <this is for GradEA, not for MPC>
> 
> Joseph initiates reboots about once a week to clear up D state NFS
> problems.  batch jobs almost always don't need to be restarted
> 
> Joseph really likes pbs, even though it's not perfect
> 
> thinking about running fedora on compute nodes
> 
> compute nodes are iptable'd to disallow incoming connections except from
> mpc.  compute nodes can get out.
> 
> pc* nodes have a lilo password since they are in labs
> 
> mpc.uci.edu/software.html has a list of compilers, among other things.=20
> Users aren't fussy about compiler upgrades
> 
> mpich and LAM use same protocol, but different API's
> 
> linda is a parallelism lib with only four commands.  it's a library
> useable from multiple compilers, $5000 and no returns if it doesn't work
> 
> grads must have their PI send Joseph e-mail to use mpc
> 
> /local/etc/run-all.csh uptime
> sorted, host list in scrpt
> 
> 'nacs' labeled nodes are opterons
> 
> air nodes are Dabdub
> 
> mpc-data.nacs has /data
> 
> mpc.uci.edu has a dcs account
> 
> tw*'s are blades, others are 1u
> 
> dell, western scientific, appro hardware
>