• What I've looked into
    1. I tried to pull data out of a specific directory, using rsync in a 1000-iteration while loop. This seemed to work well for a while, but eventually it hit a steady state where as soon as it started copying, esmfsn02 would crash. I've checked esmfsn02's logs, and there are copious Lustre errors, but I didn't see any hardware errors (2005-04-04).
    2. I also discussed the possibility of using dd_rescue with the project's author. It's a program that is intended for copying any portions of a large (block device) file that can be copied. It isn't precisely what we need, but it might be adaptable, or it could be a source of design ideas.
    3. "lfs getstripe filename" can be used to get a list of OST's that a file is striped across. We might be able to use this to derive a list of OST's that seem to be troublesome, and then avoid copying portions of files that are on those troublesome OST's. No guarantees that the lack of reliability we're seeing will be an OST thing though.
  • Things I haven't tried (yet)
    1. Lustre changes. Each of these might help Lustre reliability, and then we could copy Lustre data to another device or devices using some protocol other than NFS, for example native rsync, rsync+ssh, &c. Someone on the OCLUG mailing list, however, indicated that Lustre is unreliable even when you don't try to combine it with NFS. However, it may prove more reliable enough to be worth doing. Or not. I should also point out that if ClusterFS isn't getting their money, they may be unlikely to help us with these possibilities, and their help might be very useful.
      1. We could upgrade to a newer Lustre, but without NFS support
      2. We could downgrade to a newer Lustre, without NFS support
      3. We could use a contemporary Lustre, but without NFS support
    2. Pulling data out of a troubled, striped filesystem when the good stripes are still accessible, and the bad stripes give read errors.
      1. We could write some code that would pull out what it can, and keep track of what parts have been extracted successfully, and what parts have not. Given a python module to perform set arithmetic on sparse sets, it probably wouldn't take that long to code up - I'm guessing less than 40 hours of staff time.
      2. We may be able to identify a relationship between which OST's are yielding problems, by keeping track of what file regions are on which OST's, and which reading from with OST's is causing lustre crashes - and then mark those OST's (or just regions, if there is no reasonably useful relationship) off as bad, no longer trying to get them out anymore.
      3. If we're generating a lot of crashes, it's almost certainly going to speed things up a lot to have some form up network-addressable power strip, to automate the many reboots we're likely to require to get data back. However, even with this automation, the additional time due to reboots and lustre restarts may be prohibitive.
    3. We have the option, as well, of combining both of these methods, if it's looking like that would be helpful.

    Hits: 3368
    Timestamp: 2024-03-01 12:22:45 PST

    Back to Dan's tech tidbits

    You can e-mail the author with questions or comments: