Memo: 25, exported from J-Pilot on 10/13/04 03:16PM
Category: Unfiled
This memo was marked as: Not Private
----- Start of Memo -----
2004-10-13 Lustre presentation to CCS

Feel free to ask questions, even hard ones that I may not know the answer to.

I'll be talking about lustre, which is available from a company named ClusterFS, aka CFS.

What I've prepared is broken down into 10 sections.

1) The motivation for running lustre
2) Related technologies
3) Licensing terms and availiability
4) General structure and terminology
5) An overview of how to configure lustre
6) Starting up and shutting down lustre
7) Things to watch out for when using lustre, including filesystem limits
8) Optimizations used by Lustre for performance
9) Resources for learning more about lustre
10) Table of contents from the Lustre Operations Manual

-----

Motivation:

Size:

ClusterFS claims that they can not only break through the 2 terrabyte limit (of linux 2.4 block devices), but also the 16 terrabyte limit of the linux buffercache.  They also claim to be able to improve speed by striping across systems (to a point - large stripe counts increase latency).  They do this by aggregating multiple systems' disks into a single filesystem.

Performance:

On the ESMF test Lustre filesystem, ClusterFS was seeing 30-60megabytes/sec, and regarded that as overly slow for the hardware being used (Most of the systems are 2 CPU 3GHz Xeons with 3Ware 8506 RAID cards and a bunch of Maxtor 251G SATA disks).  Lustre indicates that with this hardware, they would expect to see more like 150megabytes/sec from two OSS's like ours.  I'll define OSS in a moment.  The ESMF has a handful of networks, but the relevant one is is gigabit ethernet.

Lustre's freshmeat page, at http://freshmeat.net/projects/lustre/, says:

Lustre is a novel storage and filesystem architecture and implementation suitable for very large clusters. It is a next-generation cluster filesystem which can serve clusters with tens of thousands of nodes, petabytes of storage, move hundreds of GB/sec with state-of-the-art security and management infrastructure.

-----

Related technologies:

I've experimented with (nbd, enbd), (md and lvm)2; (ext3, jfs, reiserfs and xfs); (linux 2.4 and linux 2.6); and no combination of these technologies was stable when used to create a 2.7 terrabyte filesystem from 3 slightly-less than 1 terrabyte slices.  However, if you remove nbd and enbd from the picture, lvm2 may work well on linux kernel 2.6.x for breaking the 2 terrabyte barrier, but most likely not for breaking the 16 terrabyte barrier, as I'm -guessing- that LVM2 uses the linux buffer cache - so then
your max filesystem size becomes a matter of how many big disks you can cram into one system or 16 terrabytes, whichever is smaller.  BTW, I posed the question: "Can LVM2 surpass 16 terrabytes?" on the linux lvm mailing list, and got 0 responses.  I have a web page documenting these attempts at http://dcs.nac.uc.edu/~dstromberg/nbd.html.

GFS, which was purchased by Redhat from Sistina recently, is another
distributed filesystem.  However, it is limited to 2 terrabytes on RHEL
3, and we're told that while this restriction will be lifted, it will
only be raised to 16 terrabytes.   It has also proved fairly unstable,
even with less than 2 terrabyte slices - the problem is possibly instability under heavy load. Also, the systems that we are running GFS on are SMP, which can lead to race conditions in less-used portions of linux. Francisco is a good person to ask about GFS issues.  

Coda and Intermezzo are also somewhat related.  I am not very familiar with them, but I'll point out that they were written by the developers of Lustre.

-----

Licensing terms and availability of Lustre:

1.4 is only available to customers as of September 2004, and has not moved out of beta.  1.2 is GPL'd.

Lustre is -only- available for Linux 2.4 (and perhaps older kernels) at this time.  They're working on Linux 2.6, but it isn't in production yet.  Other OS's are not supported, and I've heard nothing of plans to support other OS's, neither client nor server.

Sometimes you can reexport your Lustre filesystem over another protocol to get around this linux-only limitation.  More on that later.

-----

General structure and terminology of Lustre:

Disks holding user data are called "OST's", or "object storage targets".

Systems to which these user-data disks are attached are called "OSS's", or "object storage servers".  You can have multiple OST's per OSS.  Generally larger OST's are better for performance, but in the 2.4 kernel at least, ext3 imposes a 2 terrabyte limit on the size of OST's.

Some people use the acronyms "OSS" and "OST" interchangeably, but strictly speaking, they are different things.

The system that holds file metadata is called the "MDS", or "metadata server" - if you think of inodes, that's basically what this is.  Presently, you can only have one MDS per Lustre filesystem, but CFS plans to allow multiple MDS's in a future release.  I do not know if the multiple MDS's will be for performance, availability, or both.  Multiple MDS support is likely to be usable to gain more than one node's worth of MDS storage space.  To my knowledge, disks on the MDS are not called OST's.

The systems that make use of the Lustre filesystem are just called "Lustre clients".  The Lustre clients speak directly to the OST's and the MDS.

It's best to separate these functions, so that no one machine has two Lustre roles,otherwise you may experience race conditions, and Bad Things.

The OST's are actually ext3 filesystems behind the scenes, but they are not mounted in the usual sense; they are only accessed through kernel API's, instead of the usual userspace API's.  However, if your Lustre filesystem is shut down, you can semi-safely manually mount them, and see the stuff that's in them.  However, one file in these ext3 filesystems do not necessarily correspond to one file in the Lustre filesystem.

There are also LOV's, which are logical object volumes.  You can just use one.  If you don't create one, one will be created for you automatically.  Our class did not cover multiple LOV configs.  You can use multiple LOV's to get multiple Lustre filesystems out of the same set of Lustre servers.

ClusterFS used to use a customized version of ext3, but I believe they've merged their changes back into the mainline kernels, so they have been able to move to unadorned ext3.

Data is striped across the OST's.  For a given file, with a stripe count of 2, you would put n bytes in an ext3 file on the first OST, next n bytes in an ext3 on the second OST, and then the next n bytes in the same file on the first OST again.

-----

Configuring Lustre:

1) You create a shell script that tells lustre how it is configured.

2) Your script invokes the lmc command multiple times to create a single config.xml.  You can throw away this file any time, and recreate it from your shell script.  It's best not to edit config.xml directly.  lmc itself is a wrapper around lctl.

3) This config file should be shared across all of the machines participating in the lustre cluster (unless you're using zeroconf, in which case you don't need config.xml on the clients, but I'm not convering zeroconf).  This file goes in /etc/lustre/config.xml.

3) In this config.xml is all the information about the lustre cluster, including "--node"'s (one for each of the OSS's, one for the MDS, one entry for all clients), which host is the MDS with --mds, OST's with --ost, an LOV with --lov (optional, but my examples all create one - if you do not create one, one will be created for you), and a single client with --node client and --nid '*' that should represent all of your Lustre client.

4) To start up lustre, you feed your config.xml to lconf on each system in the lustre cluster.

5) If you have to make changes to your Lustre Cluster, then regenerate your config.xml, and then feed it to lconf on your -inactive- MDS.

I'll talk about where you can find example scripts later when I talk about lustre resources.

-----

Starting up and shutting down:

starting lustre
start oss's in parallel
start mds
start clients in parallel

Shut down is the same thing in reverse.

Lustre supports a facility for making a stage of the startup wait for its prerequisites, however I have not looked into how to set this up using Lustre-native tools.  Another alternative is to have a script that ssh's as needed, using passwordless, passphraseless ssh, starting things up in the proper order.  It could probably be parallelized.  This leverages a technology that is useful in many scenarios, but becomes one more thing to maintain as your lustre cluster evolves - the lustre-native solution should "just do the right thing".

-----

Things to watch out for:

The LDAP support in Lustre is reportedly going away, so it's best avoided.  In current releases, you can use LDAP as an alternative to config.xml, but this is to be removed from future releases.

Combining Lustre with Linux 2.4's in-kernel NFS is currently unstable.  ClusterFS is working to eliminate this problem for UCI on its ESMF contract.  It -might- be working now, but testing it is taking a number of days per test iteration.  When I say "working", I mean "no lustre errors".  We are continuing to get NFS write errors with an AIX 5.1 NFS client and RHEL 3 NFS server when reexporting Lustre over in-kernel NFS.

Lustre reportedly has been reexported using samba successfully.  I'm willing to hazard a guess that UNFSD (a userspace NFS implementation) might be able to reexport a Lustre filesystem reliabily as well, but try to get IBM to support that!  :)

All of the OST's must be of equal size.  If not, then all OST's are treated as though they were the same size as the smallest OST, and the extra space in the larger OST's will be wasted (for Lustre purposes.  I believe yaou could still use this space for other things).

Your largest filesize should be (the size of your smallest OST minus ext3 overhead) * the stripe count for the relevant LOV.

Your filesystem size should be the sum of the sizes of your OST's minus their ext3 overhead.

Someday, zeroconf is intended to be The Main Way to configure lustre, but for now, zeroconf can only do a fraction of what lconf can.

The MDS requires disk space of around 3 percent of the size of your filesystem, but this varies - if you have a large number of small files, then you'll need more MDS space, and vice versa.  Also, the more stripes you have, the more MDS space you'll require.

ClusterFS believes mkfs does not increase the ext3 journal size with filesystem size fast enough, so they increase it faster.  However, it is not clear to me if they are doing this by manually giving a magic option to their tools (which do the mkfs's for you), or if their tools are doing this by default.

Lustre does not yet support quotas.  Quota support is anticipated in 2005.  You could probably fake quotas with some sudo magic around rm+chown, plus a du+chown cron.

IBM, after talking things over with CFS, appears to regard adding an OST to an existing Lustre filesystem as somewhat hazardous to the filesystem's data.  The are making a disclaimer basically like "we aren't responsible if all of your data is lost".

Because Lustre skips past the linux buffer cache, tweaks to the linux buffer cache intended to improve our 3ware raid card performance are not helping much, if at all.  That means that for purposes of lustre, sysctl's involving vm. or /proc tweaks under /proc/sys/vm are probably OK for some systems, but apparently not for Lustre systems (unless they have other disks on them that are not accessed through Lustre).

I once experienced a 170 minute ls on a small directory in a Lustre filesystem that was receiving large amounts of data.  I got conflicting reports from two different people at Lustre on whether this was fixable with production code.  One of them said it was a known problem in modern lustre code, and that fixing it was not a high priority.  The other said that if we get recent enough lustre code that uses "glimpse", the problem should go away.  However, we are using pretty recent code, and the other ClusterFS guy said he believed that all lustre releases use glimpse; he also indicated that this is probably a Linux VFS problem.

If you change your Lustre filesystem's striping, files created before the time of the change are not rearranged.  They preserve their previous striping, while new files are created with the new striping.

Recommended stripe sizes:
1.2: 512k
1.4: 1m
Must be a largish power of 2!

-----

Optimizations used by lustre for performance

Lustre attempts to keep 5-6 transfers in flight.  Writing to the clients is asynchronous.  Writing to the OST's is synchronous.

Obviously striping can increase performance.

Their portals layer (actually from Sandia National Laboratories, I believe) supports channel bonding.  There is also the linux-kernel's channel bonding.  I believe neither are set up by default.  The portals bonding is supposed to be more effective and better on failover than the linux kernel's channel bonding, but you're more likely to reuse the knowledge you gain if you use the linux kernel's version.

They support gig-e, 10 gig-e, and elan.  elan loads CPU's relatively little.  They plan to support inifiband before long.  They say that it's difficult to saturate two-channel gig-e from a 32 bit x86 box.

Lustre precreates inodes (empty files) on the OST's for the stripes.

Filesystem recovery is simple after a crash: the MDS just tells the OST's to delete all data files with numbers above a certain threshold that had not been committed into the OST's yet.

-----

Lustre knowledge resources:

1) You can get a contract with ClusterFS.  Note that the CFS folks, even with a support contract, don't appear to like telephones much, but they are open to e-mail, IRC, and various forms of instant messaging.  Maybe they have a higher fee for phone support; I really don't know.

2) I have a web page with some lustre notes, including the notes I took on my palm pilot during the lustre training in San Francisco, at https://stromberg.dnsalias.org/~strombrg/Lustre-notes.html.  I'll also put my notes from this talk on that page, and e-mail these notes to whatever address you folks prefer.  I'm also happy to try to answer questions in person, on the phone, or in e-mail.  Duncan, who attended the Lustre training in San Francisco with me, indicates that he is open to answering Lustre questions as well.

3) There are lustre-related mailing lists at https://lists.clusterfs.com/mailman/listinfo.  Some or most likely all of them are moderated, but they do appear to let some of the more awkward questions get out to the list.

4) freenode has a #lustre channel on IRC.  There's also a #uci-cfs channel on freenode, but it's not clear if ClusterFS prefers to keep this ESMF-only, or is leaving it open for all of UCI.  I'm guessing that if you use #lustre, you won't be required to have a contract, but in #uci-cfs, you likely will.

5) Duncan and I both have "Lustre Operations Manuals".   This manual is apparently normally only given to people who attend the Lustre training.  We're both open to letting people borrow them.

6) https://wiki.clusterfs.com/lustre/BugFiling is a wiki about using ClusterFS's bugzilla.

7) There is supposedly a /usr/lib/lustre/examples on lustre systems that can be helpful.  However, our lustre systems in the ESMF do not appear to have it.

-----

Table of contents from the Lustre Operations Manual

1) Prerequisites
2) Creating a new filesystem
3) Configuring monitoring
4) Health checking and troubleshooting
5) Health checking (sic)
6) Managing configurations
7) Managing lustre
8) Mixing architectures

-----

The End :)


----- End of Memo -----