The general flow of usage is:
Commandline arguments are like:
esmf04m-root> ./try-copying-up-to-n-times ./try-copying-up-to-n-times: -i, -r, -m, -g, -s and -e are mutually exclusive, and exactly one must be specified Usage: ./try-copying-up-to-n-times [-i databasefile initialrepetitions] [-r databasefile initialrepetitions] [-e databasefile] "-i databasefile repetitions" says initialize the database. Repetitions is the max number of times we will try to copy a given file "-r databasefile" says restart: continue counting down repetitions "-e databasefile" says delete a preexisting database "-d sourcehier desthier" says to copy data from sourcehier to desthier "-m databasefile repetitions filename" says to set filename's repetition count to a specific value, manually "-g database filename" says get the value for filename's repetition count "-s database" says to summarize counter status for all files in the database "-v n" says to operate verbosely. Higher n is more verbose. 1 is only for definite error conditions, 2 is surprise (non-)preexistence conditions, and 3 is for the whole ball of wax "-c shellcommand n" says to run shellcommand after attempting to copy n files. If the command returns POSIX shell false, ./try-copying-up-to-n-times will exit. Otherwise we continue "-C n" says that if %s sees n consecutive file errors, terminate prematurely -i, -r, -m and -e are mutually exclusive Only regular files, directories and symlinks are handled at this time. Hard links are not preserved, their relationship will be broken silently This program uses the python anydbm interface, so it may seemingly at random choose a backend database like berkeley db, gdbm, dbm, dumbdbm or others. However, once a database of a given name is created, subsequent usage of that same database name should come up with the same type.This is a letter I sent to a client, about combining an automatic reboot solution from CPS, with my try-copying-up-to-n-times script:
My script that only retries copying files a user-specified number of times seem to help, but I'm still making so many trips to the machine room that it's taking forever.
The initial positive-indication with Francois' data, unfortunately, turned out to be a case of Lustre giving my program a far-too-short list of files to transfer, I believe. I've since modified the program to re- enumerate all files in a directory hierarchy each time it is run, instead of only the first time, in an effort to pick up files that are visible sometimes and not others.
I've also modified my "lustre" bash script to greatly reduce the back- and-forth between myself and lustre when rebooting lustre, but it's still slow going, entirely because of the physical need for reboots.
Given how slowly things are progressing, I'm ready to suggest that either we give up with what little data we have so far, or that we investigate some form of automatic reboot facility - perhaps something from http://www.cpscom.com/reboot.htm . I am currently estimating that this could cost as little as:
1) A base unit that cycles AC power, costing $115
2) A power strip with enough plugs 7 or more plugs
3) A serial cable
4) A spare PC running linux, possibly esmft2, to act as a "controller"
in CPS' parlance.
Is there interest in investigating this sort of approach, or should we just drop it? The automation software I'd need is almost entire already done.
I must point out, that even with this sort of solution, we may still find that we get only a small fraction of the ESMF data back. For example with Jin Yi's data, this is how far my program has progressed so far:
Succeeded: 1420
Failed: 349
1: 94
2: 108
3: 1
4: 3930
5: 39046
The "succeeded" and "failed" numbers are probably pretty apparent. The "1" through "5" mean that there are n files that still need to be retried from 1 to 5 times (on separate, serial runs of the program) before concluding that they are recoverable or not.
You can e-mail the author with questions or comments: