Memo: 49, exported from J-Pilot on 09/23/04 05:22PM Category: Unfiled This memo was marked as: Not Private ----- Start of Memo ----- Lustre class, first day, 2004-09-09 09:10:59 am 1000's of clients today, 10000's later failover: shared storage OSS: object storage server, OST is the old name logical object volume: LOV clients talk directly to mds and oss, small amt communication between oss's and mds mds, mdt synonyms oss: single node ost: one lun one oss can have multiple ost's ext3 for backend storage failover for both mds and oss's, both require shared storage opterons: better bus, pci-x, larger files opteron seeing decent numbers but "I don't think we've reached production on that code" 2T limit imposed by ext3 infiniband next target, at least a couple of months off tcp: hard to saturate dual gig-e on 32 bit. elan3 lesscpu usage, higher bandwidth elan4 (on ia64) faster than 10gig-e on opteron firewire theoretically workable for shared storage, but cfs tried it and it didn't work more stripes, more space needed on mds. usually mds is 3 percent of oss aggregate size cfs believes mkfs does not increase journal size with filesystem size fast enough, so they increase it faster mds only cares about file size and which oss has the file. block allocation done on oss. oss cares about file attributes except for the blocks limits imposed by linux and ext3 vfs change needed to get lustre into linus' 2.6 different kernels on nodes ok, but should have same rev of lustre patch 2.6 version usable, but not tested as well as 2.4 clients don't ned python anymore, servers do limits imposed by ext3, they modified ext3 'extN', but use plain ext3 now block numbers. 32 bits * 4k(actually half that) is maximum address size ext3: max fs size 2T, max file size 2T opensource release trails version given to customers ext4 not expected to have better limits. used to support reiser. ext3 only now. could in theory support any journaling fs 1.2, 1.4 limited to 200 oss's. later lustre releases should allow more oss's if you want a file bigger than 2T, you must stripe it across multiple ost's 64 bit: 'un'limited. bigger directory cache = faster. much faster I/O High Avaiability:: redhat 'cluster manager' suse 'heartbeat' 2.6: 4T ? portals from sandia, NAL Vender net driver portal nal portal lib net I/O API lustre rpc's support multiple physical network types due to nal/portals 0 copy I/O nodes are identified by a single 'NID', a network-neutral node id. usually hostname or elan id is used channel bonding. portals is channel bonding-aware, round robin. portals ch bonding more effective than that in linux, also smarter about dealing with failures doing it via linux means ch bonding works for more than just portals, and is knowledge you're more likely to reuse failures detected with timeouts clients attempt to reconnect 'aggressively' servers will evict nonresponsive clients 1.4 does not require 'upcalls' anymore, 1.2 does lustre highly modularized lots of modules plug in another filesystem? do it at lvfs layer in theory, if client and oss same machine, client could talk directly to obd layer, but cfs hasn't done this in a long time ppl usually use the largest ost's they can, less than 2T though oss manages only the file data, objects are stored as files in an ext3 fs eventually will distribute file metadata across multiple mds's - in 2.0? mds stores file attributes (not file data) on ext3 fs. handles all new file creation mds tells oss two things: 1 create n files 2 delete all files with a number greater than m - makes crash recovery easier file numbers are not reused kernel patch, user space tools, /proc interface configuration: lmt: new tool that makes config easier, web interface, a few releases have been made, might not be made opensource. we're not going to discuss lmt net type, nodes, clients nothing specific just the network type config in xml nfs, ldap for distributing configs, ldap going away, used only for complex failover scenarios zeroconf used on clients, no xml needed lconf in python, input ldap, xml in future zeroconf on servers: mount -t ostfs /dev/... mount -t mdsfs /dev/... on clients: zeroconf: mount -t lustre server://config /mnt/lustre requires /sbin/mount.lustre lconf - one entry for all clients lmc: lustre make config lconf lctl add one element to config via lmc at a time examples: orca: mds grey, blue: oss's clients do nets do services (mds first) add lov (1.4 1M stripe size) stripe count: 0 says 'across all OSS's' lov associated with an mds, one lov per mds. can have multiple lov's for an mds, means multiple config files add clients, add ost not oss, specify an lov apps write asynch, writes on oss's are synchronous zeroconf requires modules.conf, done by sysad multiple fs's: (mult xml files, lov's), (mds's?) if config changes, update with lconf on inactive mds ldap can keep track of what nodes are up, in failover starting lustre start oss's in parallel start mds start clients in parallel shut down in reverse order lconf --cleanup to shutdown lustre lconf --reformat config.xml checks only that data isn't mounted may need to spec ldap server lconf --write-conf config.xml ...if config changes, eg adding an oss remount clients. add ost too? done on mds lconf --node client config.xml client is literal, not metasyntax mount -t lustre orca://mds1/client /mnt/lustre no lconf or python needed need mds set up need modules.conf stop client: zeroconf: just umount otherwise use lconf --cleanup zeroconf on client is faster, lconf is slow due to xml parsing lconf needed for fancy stuff with portals hands on, first dat: umlscreen 5 of them working on redhat 8 in the uml's rpm -i 4suite python-ldap shutdown mds first, oss's, clients lconf --cleanup lconf --node client config.xml bring up up ost's before mds, otherwise mds will wait on ost's lconf --cleanup --force if a client dies tunctrl -d to clean up tap devices when using UML lconf oss's first, then mds, then client ltuml -n 1 ltuml -n 2 failover: first mds added will be what gets used need shared storage for failover stop failover node: lconf --cleanup --force --failover simulates crash --group limits cleanup ex: orca2 is failover mds blue, gray remain oss's when adding failover mds, add net for it, use same mdscname, different node name failover ost: same ost name, different node name, --group important only one node is active at a time passive failover: lconf --select or lactive --select --group (limit lconf to a specific set of devices) upcall..l lconf --recover... uuid's disconnect, connect to new node, lctl recover stripe size needs to be a multiple of page size? 65536 ok, 65535 has serious problems debugfs ost1 ls ----- End of Memo -----