Memo: 49, exported from J-Pilot on 09/23/04 05:22PM
Category: Unfiled
This memo was marked as: Not Private
----- Start of Memo -----
Lustre class, first day, 2004-09-09 09:10:59 am

1000's of clients today, 10000's later

failover: shared storage

OSS: object storage server, OST is the old name

logical object volume: LOV

clients talk directly to mds and oss, small amt communication between oss's and mds

mds, mdt synonyms
oss: single node
ost: one lun
one oss can have multiple ost's

ext3 for backend storage

failover for both mds and oss's, both require shared storage

opterons: better bus, pci-x, larger files

opteron seeing decent numbers but "I don't think we've reached production on that code"

2T limit imposed by ext3

infiniband next target, at least a couple of months off

tcp: hard to saturate dual gig-e on 32 bit.  elan3 lesscpu usage, higher bandwidth

elan4 (on ia64) faster than 10gig-e on opteron

firewire theoretically workable for shared storage, but cfs tried it and it didn't work

more stripes, more space needed on mds.  usually mds is 3 percent of oss aggregate size

cfs believes mkfs does not increase journal size with filesystem size fast enough, so they increase it faster

mds only cares about file size and which oss has the file.  block allocation done on oss.  oss cares about file attributes except for the blocks

limits imposed by linux and ext3

vfs change needed to get lustre into linus' 2.6

different kernels on nodes ok, but should have same rev of lustre patch

2.6 version usable, but not tested as well as 2.4

clients don't ned python anymore, servers do

limits imposed by ext3, they modified ext3 'extN', but use plain ext3 now

block numbers.  32 bits * 4k(actually half that) is maximum address size

ext3: max fs size 2T, max file size 2T

opensource release trails version given to customers

ext4 not expected to have better limits.  used to support reiser.  ext3 only now.  could in theory support any journaling fs

1.2, 1.4 limited to 200 oss's.  later lustre releases should allow more oss's

if you want a file bigger than 2T, you must stripe it across multiple ost's

64 bit: 'un'limited.  bigger directory cache = faster.  much faster I/O

High Avaiability::
redhat 'cluster manager'
suse 'heartbeat'

2.6: 4T ?

portals from sandia,  NAL

Vender net driver
portal nal
portal lib
net I/O API
lustre rpc's

support multiple physical network types due to nal/portals

0 copy I/O

nodes are identified by a single 'NID', a network-neutral node id.  usually hostname or elan id is used

channel bonding.  portals is channel bonding-aware, round robin.  portals ch bonding more effective than that in linux, also smarter about dealing with failures

doing it via linux means ch bonding works for more than just portals, and is knowledge you're more likely to reuse

failures detected with timeouts
clients attempt to reconnect 'aggressively'

servers will evict nonresponsive clients

1.4 does not require 'upcalls' anymore, 1.2 does

lustre highly modularized
lots of modules

plug in another filesystem?  do it at lvfs layer

in theory, if client and oss same machine, client could talk directly to
obd layer, but cfs hasn't done this in a long time

ppl usually use the largest ost's they can, less than 2T though

oss manages only the file data, objects are stored as files in an ext3 fs

eventually will distribute file metadata across multiple mds's - in 2.0?

mds stores file attributes (not file data) on ext3 fs.  handles all new file creation

mds tells oss two things:
1 create n files
2 delete all files with a number greater than m - makes crash recovery easier

file numbers are not reused

kernel patch, user space tools, /proc interface


configuration:
lmt: new tool that makes config easier, web interface, a few releases have been made, might not be made opensource.  we're not going to discuss lmt

net type, nodes, clients nothing specific just the network type

config in xml
nfs, ldap for distributing configs, ldap going away, used only for complex failover scenarios

zeroconf used on clients, no xml needed

lconf in python, input ldap, xml

in future zeroconf on servers:
mount -t ostfs /dev/...
mount -t mdsfs /dev/...


on clients:
zeroconf:
mount -t lustre server://config /mnt/lustre
requires /sbin/mount.lustre

lconf - one entry for all clients

lmc: lustre make config
lconf
lctl

add one element to config via lmc at a time

examples:
orca: mds
grey, blue: oss's
clients

do nets
do services (mds first)
add lov (1.4 1M stripe size)
stripe count: 0 says 'across all OSS's'
lov associated with an mds, one lov per mds.  can have multiple lov's for an mds, means multiple config files

add clients, add ost not oss, specify an lov

apps write asynch, writes on oss's are synchronous

zeroconf requires modules.conf, done by sysad

multiple fs's: (mult xml files, lov's), (mds's?)

if config changes, update with lconf on inactive mds

ldap can keep track of what nodes are up, in failover

starting lustre
start oss's in parallel
start mds
start clients in parallel

shut down in reverse order

lconf --cleanup
to shutdown lustre

lconf --reformat config.xml
checks only that data isn't mounted
may need to spec ldap server

lconf --write-conf config.xml
...if config changes, eg adding an oss
remount clients.  add ost too?
done on mds

lconf --node client config.xml
client is literal, not metasyntax

mount -t lustre orca://mds1/client /mnt/lustre
no lconf or python needed
need mds set up
need modules.conf

stop client: zeroconf: just umount
otherwise use lconf --cleanup

zeroconf on client is faster, lconf is slow due to xml parsing
lconf needed for fancy stuff with portals


hands on, first dat:
umlscreen 5 of them
working on redhat 8 in the uml's

rpm -i
4suite
python-ldap


shutdown mds first, oss's, clients
lconf --cleanup

lconf --node client config.xml

bring up up ost's before mds, otherwise mds will wait on ost's

lconf --cleanup --force
if a client dies

tunctrl -d to clean up tap devices when using UML

lconf oss's first, then mds, then client

ltuml -n 1
ltuml -n 2

failover:
first mds added will be what gets used
need shared storage for failover

stop failover node:
lconf --cleanup --force --failover
simulates crash

--group limits cleanup

ex:
orca2 is failover mds
blue, gray remain oss's

when adding failover mds, add net for it, use same mdscname, different node name

failover ost: same ost name, different node name, --group

important only one node is active at a time

passive failover:
lconf --select or lactive
--select --group (limit lconf to a specific set of devices)

upcall..l lconf --recover... uuid's

disconnect, connect to new node, lctl recover

stripe size needs to be a multiple of page size?  65536 ok, 65535 has serious problems

debugfs ost1
ls


----- End of Memo -----