data recovery from a stocasticly flakey/crashy filesystem: try-copying-up-to-n-times

This software is owned by The university of California, Irvine, and is not distributed under any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
try-copying-up-to-n-times is a script I wrote to facilitate recovering data from a filesystem that shows inconsistent behavior - sometimes a file looks fine, and other times the file won't be readable, or the filesystem will even crash when you try to read a particular file or series of files.

The general flow of usage is:

Create and seed a database for the source directory, with -i, carefully selecting the number of reattempts you want before a file is marked "failed", aka "unrecoverable". Wait while an iteration is attempted.
Run more iterations with -r, until you've had enough.
In both of the above cases, you may want to specify -c or -C to determine some conditions under which the script will stop trying for the time being (until you rerun it).

Commandline arguments are like:

esmf04m-root> ./try-copying-up-to-n-times 
./try-copying-up-to-n-times: -i, -r, -m, -g, -s and -e are mutually exclusive, and exactly one must be specified
Usage: ./try-copying-up-to-n-times [-i databasefile initialrepetitions] [-r databasefile initialrepetitions] [-e databasefile]
"-i databasefile repetitions" says initialize the database.
    Repetitions is the max number of times we will try to copy a given file
"-r databasefile" says restart: continue counting down repetitions
"-e databasefile" says delete a preexisting database
"-d sourcehier desthier" says to copy data from sourcehier to desthier
"-m databasefile repetitions filename" says to set filename's repetition count to a specific
    value, manually
"-g database filename" says get the value for filename's repetition count
"-s database" says to summarize counter status for all files in the database
"-v n" says to operate verbosely.  Higher n is more verbose.  1 is only for definite error conditions,
    2 is surprise (non-)preexistence conditions, and 3 is for the whole ball of wax
"-c shellcommand n" says to run shellcommand after attempting to copy n files.  If the command
    returns POSIX shell false, ./try-copying-up-to-n-times will exit.  Otherwise we continue
"-C n" says that if %s sees n consecutive file errors, terminate prematurely

-i, -r, -m and -e are mutually exclusive

Only regular files, directories and symlinks are handled at this time.  Hard links are not
preserved, their relationship will be broken silently

This program uses the python anydbm interface, so it may seemingly at random choose a backend
database like berkeley db, gdbm, dbm, dumbdbm or others.  However, once a database of a given
name is created, subsequent usage of that same database name should come
up with the same type.

This is a letter I sent to a client, about combining an automatic reboot solution from CPS, with my try-copying-up-to-n-times script:

My script that only retries copying files a user-specified number of times seem to help, but I'm still making so many trips to the machine room that it's taking forever.

The initial positive-indication with Francois' data, unfortunately, turned out to be a case of Lustre giving my program a far-too-short list of files to transfer, I believe. I've since modified the program to re- enumerate all files in a directory hierarchy each time it is run, instead of only the first time, in an effort to pick up files that are visible sometimes and not others.

I've also modified my "lustre" bash script to greatly reduce the back- and-forth between myself and lustre when rebooting lustre, but it's still slow going, entirely because of the physical need for reboots.

Given how slowly things are progressing, I'm ready to suggest that either we give up with what little data we have so far, or that we investigate some form of automatic reboot facility - perhaps something from http://www.cpscom.com/reboot.htm . I am currently estimating that this could cost as little as:

1) A base unit that cycles AC power, costing $115
2) A power strip with enough plugs 7 or more plugs
3) A serial cable
4) A spare PC running linux, possibly esmft2, to act as a "controller"
in CPS' parlance.

Is there interest in investigating this sort of approach, or should we just drop it? The automation software I'd need is almost entire already done.

I must point out, that even with this sort of solution, we may still find that we get only a small fraction of the ESMF data back. For example with Jin Yi's data, this is how far my program has progressed so far:

Succeeded: 1420
Failed: 349
1: 94
2: 108
3: 1
4: 3930
5: 39046

The "succeeded" and "failed" numbers are probably pretty apparent. The "1" through "5" mean that there are n files that still need to be retried from 1 to 5 times (on separate, serial runs of the program) before concluding that they are recoverable or not.

Known bugs:

Sometimes, it's possible for the script to error out because of a missing directory in the target hierarchy. For now, one can just mkdir it, and rerun the script.
If the filesystem containing your database gives a write error due to a full filesystem (and perhaps other reasons as well), then your database may become corrupted (at least if you're using Berkeley DB 4.2, AIX 5.1 ML 4 and python 2.4.x) - not sure about other python database interfaces). However, the maximal corruption I've seen so far could be temporarily corrected with:
- ./try-copying-up-to-n-times -s /tmp/Francois-subset-db
- Make a note of any key that the above step tracebacks on. For the sake of discussion, assume it is "run110/xrun110-68590000.field"
- Then run "./try-copying-up-to-n-times -m /tmp/Francois-subset-db 2 run110/xrun110-68590000.field" to correct the problem with key, where "2" is the number of times you wish to retry the file in question.
However, this lead to a series of many such manual tweaks, so eventually I wound up writing a small python script that would obtain a list of all the keys in a database, check that the the have data associated with them, and write them out to another database. I converted dbhash to gdbm, but dbhash to dbhash might have worked as well.

Future directions

A python module for handling ranges of numbers (EG, file blocks or so). Might be nice to make the program understand how to retrieve parts of files, and not just treat files in such an all-or-nothing manner.

Hits: 3950
Timestamp: 2025-10-11 23:20:40 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: