I tried to pull data out of a specific directory, using rsync in
a 1000-iteration while loop. This seemed to work well for a
while, but eventually it hit a steady state where as soon as it
started copying, esmfsn02 would crash. I've checked esmfsn02's
logs, and there are copious Lustre errors, but I didn't see any
hardware errors (2005-04-04).
I also discussed the possibility of using dd_rescue with the
project's author. It's a program that is intended for copying any
portions of a large (block device) file that can be copied. It
isn't precisely what we need, but it might be adaptable, or it
could be a source of design ideas.
"lfs getstripe filename" can be used to get a list of OST's that
a file is striped across. We might be able to use this to derive a
list of OST's that seem to be troublesome, and then avoid copying
portions of files that are on those troublesome OST's. No
guarantees that the lack of reliability we're seeing will be an
OST thing though.
Things I haven't tried (yet)
Lustre changes. Each of these might help Lustre reliability, and
then we could copy Lustre data to another device or devices using
some protocol other than NFS, for example native rsync, rsync+ssh,
&c. Someone on the OCLUG mailing list, however, indicated
that Lustre is unreliable even when you don't try to
combine it with NFS. However, it may prove more reliable enough
to be worth doing. Or not. I should also point out that if
ClusterFS isn't getting their money, they may be unlikely to help
us with these possibilities, and their help might be very useful.
We could upgrade to a newer Lustre, but without NFS support
We could downgrade to a newer Lustre, without NFS support
We could use a contemporary Lustre, but without NFS support
Pulling data out of a troubled, striped filesystem when the good
stripes are still accessible, and the bad stripes give read errors.
We could write some code that would pull out what it can, and
keep track of what parts have been extracted successfully, and
what parts have not. Given a python module to perform set
arithmetic on sparse sets, it probably wouldn't take that long to
code up - I'm guessing less than 40 hours of staff time.
We may be able to identify a relationship between which OST's
are yielding problems, by keeping track of what file regions
are on which OST's, and which reading from with OST's is
causing lustre crashes - and then mark those OST's (or just
regions, if there is no reasonably useful relationship) off as
bad, no longer trying to get them out anymore.
If we're generating a lot of crashes, it's almost certainly
going to speed things up a lot to have some form up
network-addressable power strip, to automate the many reboots
we're likely to require to get data back. However, even with
this automation, the additional time due to reboots and lustre
restarts may be prohibitive.
We have the option, as well, of combining both of these methods,
if it's looking like that would be helpful.