Memo: 50, exported from J-Pilot on 09/23/04 05:22PM
Category: Unfiled
This memo was marked as: Not Private
----- Start of Memo -----
Lustre class, second day, 2004-09-10 08:46:25 am

sometimes lustre uuid's are hex gibberish (normal), but they decided descriptive text uuid's were easier to work with

mds knows size of file when closed, but not when open.

mds has zeroconf logs

oss does not know filename or file length: only the length of its stripe.  Oss does not know how many stripes

mds precreates objects on oss's, 0 size.  1.2: ~1000 precreated

on mds: state files: last_rcvd, lov_objid

can mount ext3's when lustre is shut down to see how lustre stores files

mds fs layout
last_rcvd, contains last reply data+.  all connected client state here.  if mds sees this, it knows it needs to replay some transactions

pending: rm'd but open files


oss fs layout
client state:
last_rcvd only for uuid on nonfailover setup.  no 'last transaction' data


lustre inode split acrose mds and oss
striping info in 'extended attributes area' of ext3

acl's coming

not using page cache on servers, using 'direct i/o' instead

oss performs block allocation
less tied to ext3 on ost's than mds, but still tied to it

file creation:
open(O_CREAT)
makes inode on mds
mds uses preallocated objects for lustre file inodes
5000 creations/sec by 1000 clients in same dir

ost's are stateless, no list of in-use files
mds has state: does have a list of open files

dcache cashes inode. used to have a list of open files (on oss's?), but they eliminated that

file deletion
client dels on mds, clients also del on oss's'
if failure, mds checks its llog (transaction log), sends dels to oss's

llogs:
small api
used for replay of dels, more in future
on mds and ost, mostly mds.

coda
intermezzo
lustre

'orphan' in 3 places
del failure
precreated obj's on ost
open and del'd

lconf verbose or -n -v
saved as llog record

rread on 170 min ls: it's a bug in older releases of lustre.  glimpse should help.  likely need the nfs mods made to a later release of lustre

lustre doc says stripes must be a multiple of 16k, the largest page size in common use today (ia64).  this preserves capacity for heterogenaity. However, this does not appear to be enforced by lconf, as the doc says


striping:
which ost computed on clients by lov

lfs can set striping on dir or file, size, number.  subdirs probably do not inherit that
no headers, just concatenated

when to stripe
better aggregate thruput with mult clients
good for big shared files

don't stripe to min latency.

1.2: 512k
1.4: 1m
5-6 transfers in flight

metadata: intent based, fewer distrib locks, fewer rpc's

gather mult ops into one rpc


recovery and replay
if client locks a file and crashes, lock is released

mds goes into recovery mode, only performs recovery, no new transactions.  then later does new stuff

1.4 upcalls not needed for failover, but you can still have them

failout mode only for ost's
-EIO
not really recommended

when create ost, spec --failover
a little overhead added
clients hang instead of getting errors


troubleshooting
messages file
can grep for Lustre: and LustreError'

5m circular buffer of lustre logs

faster debug perf now, but turning it off is still good for benchmarking

sysctl -w portals.debug=0
nothing
-1 is everything
high 8 bits are subsystem
low 24 bits aredebug mask

llctl debug_kernel writes lustre log, stdout, file

lctl clear 
clear lustre kernel log

debug daemon can constantly flush kernel logs to a file.  can file up a file very quickly

lctl debug_daemon ...

e2fsck with extended attribute patches can be used
lfsck.  uses e2fsck.  lfsck still in development?

buffalo.clusterfs.org/com
testing lustre
iozone

echo_client
test lower layers of lustre
test bandwidth
no client fs

echo_server

leak finder, a perl script

lbug: /tmp lustre log, binary, unsorted, lctl to read, similar to debug_kernel

they have a tcpdump that knows about portals

https://bugzilla.lustre.org/
search for bugs

https://wiki.clusterfs.com/lustre/BugFiling

tools
llctl calls ioctl's
initiate recovery

lctl device_list
not an ioctl
comes from devices under /proc

lctl --device 6 deactivate
lctl --device 6 activate
number, name uuid
on ost, ignore a failed device for a while

lfs find ... dir or file
lfs getstripe file
lfs setstripe filename size 0 1
lfs find file

lfsck, used after dataloss on mds or oss
use lfsck on client
still need e2fsck

scan mds and create mdsdb with mod'd e2fsck
scan oss and create ossdb
mount lustre fs
run lfsck on mtpt using db files

e2fsck -f -y  -mdsdb /tmp/mdsdb /dev/sdb1
oss
run on -mtpt-, feeding mdsdb and ossdb (ostdb?)

orphans in lost+found
no unaccounted storage
some files may have empty objects
not sure how to list files with empty objects

1.4 for customers only 2004-09-10 11:18:56 am, 1.2 opensource

1.3, 1.4 beta UCI using.  What we have is a branch.  It is not going to become 1.4.  Nic indicates thatcwhat we're seeing - 170 min ls - is not a glimpse issue, and does still occur in modern lustre.  He said it is probably a VFS problem.  He also indicated that he thought all versions of lustre had glimpse in them.

opteron 3ware 100mB/s
our perf is poor

llanalyze is the perl script that Robert didn't like that much

llanalyze --rpctrace
does not appear to work
set portals debug to -1 and just grep for RPC.  opc ... appears to have either program name or fs function name

to join two rpc logs, cat and sort with -t: -k4 (or +3)

ldlm  prefix on lustre locking functions

lconf has kernel debug flags.  llanalyze does to, but may be incorrect'

lconf -nv -ptldebug rpctrace+page
adds bits for... debug_kernel

echo client/echo server not that important, mostly for testing new NAL's. 

/usr/lib/lustre/examples/*









----- End of Memo -----