Looks great Dan. I put some minor changes in < > brackets.
Thanks for taking the notes.
JOseph
> Here are my notes from earlier today. Anything you care to contribute
> to this I'm sure would be valuable.
>
> Mpc and gradea
>
> 2004-09-14 11:02:44 am
> mpc.uci.edu
> mpc.uci.edu/cgi-bin/free-nodes.cgi
>
> ganglia checks mpc nodes in mach room
>
> some queues shut down during the day, running only at night
>
> can call operator to reset nodes in machine room. for nodes outside the
> machine room, call Joseph - it's fussy
>
> pc* and tw* queues are not in mach room
>
> head node can reboot without interrupting jobs
>
> uses PBS. if a node dies with a job on it, pbs may get confused, not
> utilizing other nodes until the down node is reset
>
> sharing: 1 cpu jobs can share a node, 2 cpu jobs cannot
<2 cpu jobs can share a node or not>
>
>
> the xeons have hyperthreading
>
> ppn is processes per node
>
<pbs>
> sometimes obs accepts jobs but does not run any. other times it won't
> even allow submissions
>
> if pbs is down, qstat cgi won't report anything
>
> remove node to fix pbs once every 8 months or so
>
> all redhat 9 x86/amd64 nodes, will go to rhel later
>
> mpc.uci.edu/running-jobs.html
>
> brian benz sp? can reset tw* nodes
<name is Ryan William Benz rbenz at uci.edu>
>
> rsync distribution of /local-mirror, nightly, 2am
>
> each node has different sized disk, so some things in /local-mirror will
> not fit. there are disks as small as 20G
>
> qsub to submit
>
> log into head node, then either grab a node or qsub
>
> mpc.uci.edu/commands.html
>
> private queues are for the owners, owners can also run in public queues
>
> submissions with bad geometry are rejected by qsub
>
> need to add note about bad geom to esmf notes
>
> errors out if, eg, stderr is not writeable
>
> /data 1.8 terrabytes, nfs mounted
>
> throughput only as fast as slowest node - assuming homogenous resource
> utilization
>
> 4 cpu's are shared, remainder are exclusive
>
> inconsistent nis from head node to compute nodes recently - may be bad
> data from registrar or nis building scripts. head node saw a superset
> of users on compute nodes
<this is for GradEA, not for MPC>
>
> Joseph initiates reboots about once a week to clear up D state NFS
> problems. batch jobs almost always don't need to be restarted
>
> Joseph really likes pbs, even though it's not perfect
>
> thinking about running fedora on compute nodes
>
> compute nodes are iptable'd to disallow incoming connections except from
> mpc. compute nodes can get out.
>
> pc* nodes have a lilo password since they are in labs
>
> mpc.uci.edu/software.html has a list of compilers, among other things.=20
> Users aren't fussy about compiler upgrades
>
> mpich and LAM use same protocol, but different API's
>
> linda is a parallelism lib with only four commands. it's a library
> useable from multiple compilers, $5000 and no returns if it doesn't work
>
> grads must have their PI send Joseph e-mail to use mpc
>
> /local/etc/run-all.csh uptime
> sorted, host list in scrpt
>
> 'nacs' labeled nodes are opterons
>
> air nodes are Dabdub
>
> mpc-data.nacs has /data
>
> mpc.uci.edu has a dcs account
>
> tw*'s are blades, others are 1u
>
> dell, western scientific, appro hardware
>