• For now, this is a list of things to check manually, but someday it should be assembled into a program or series of programs.
    1. Spot check some files' content - do they have the same digest? (md5 or sha-1 should be fine, because although they're broken for cryptographic purposes, it's still extremely unlikely you'll have an accidental collision. Also, RIPEMD-160, which apparently hasn't been broken, is a bit less convenient to use in some applications (due to the lack of a broadly available python API ))
    2. Spot check some file ownerships: both user and group - are they the same?
    3. Spot check some file permissions bits - are they the same?
    4. Does the filesystem have quotas, either user or group? Check some to see if they are the same, if so
    5. Does the filesystem have ACL's? Check some to see if they are the same, if so
    6. Does the filesystem have linux extended attributes (EG, in an ext3 filesystem)? Check some to see if they are the same, if so
  • A beginning for automatic verification: verify
  • Pseudo-code for doing a transfer with verification
    1. Inputs
      1. directory+host to transfer from
      2. directory+host to transfer to
      3. transfer protocol: rsh, ssh, native rsync
      4. absolute transfer or relative (rsync can be a relative transfer)
      5. archiver: tar, gnu tar, cpio, gnu cpio, afio, pax, dump, ufsdump...
      6. what aspects of the transfer the user is willing to ignore, if any
      7. Is X11 functional? If yes, then we could give a better running tally of how things are progressing...
      8. Degree of both final and concurrent spotchecking? Probably should be able to specify as either an absolute number of files to check, or as a percentage of files transferred.
      9. Internal table of what's flakey on which platforms
      10. User's $PATH, augmented by appending a few common directories
      11. Interval between "concurrent spotchecks": every n files, every m minutes, every n blocks, &c. Perhaps allow specifying a ceiling on the number of concurrent spotchecks to perform, since they'll tend to slow down the transfer a bit, and the assurance they give is a case of diminishing returns. Might be nice to have an exponential backoff...
      12. Directory to which we do extractions for spot checks, perhaps specified separately for final and concurrent. Could be on the source system or the destination system, and could be in the same directory hierarchy as the source or destination system, or another hierarchy entirely.
      13. block sizes to use on producer and consumer
      14. favor security or speed?
    2. Outputs
      1. new data on destination directory+host
      2. any needed error messages from during the transfer
      3. verification results
      4. total time of transfer
      5. rate of entire transfer
    3. Algorithm
      1. Preliminaries
        1. Check if quotas are enabled in source. Check both user and group. If requested archiver does not handle quotas, ask user if quotas can safely be ignored. Might favor an archiver that handles quotas if required
        2. Check if ACL's are in use in source. If requested archiver does not handle ACL's, ask user if ACL's can safely be ignored transfer. Might favor an archiver that handles ACL's if required.
        3. Check if linux extended attributes are in use in source. If requested archiver does not handle extended attributes, ask user if extended attributes can safely be ignored transfer. Might favor an archiver that handles extended attributes if required.
        4. Check if source filesystem has files > 2 gigabytes. If yes, then check if source filesystem can support files greater than 2 gigabytes. If not, ask user if this can safely be ignored. Might favor an archiver that handles > 2 gig files, if required.
        5. Check size of source data. Can it reasonably be expected to fit into the destination directory? If not, inform the user before we start! Sparse files may complicate this a bit, but a close estimate is probably sufficient anyway. Might want a tolerance range for this test.
        6. Might have an option to automatically try to guess what kind of compression, if any, would be best: gzip, bzip2, probably not rzip unless perhaps it could be used on individual files....
        7. Might have an option to automatically try to guess the throughput of the network connection (if any).
      2. The data transfer
        1. The actual data transfer, assuming prereq's are satisfied
        2. Possibly do some "concurrent spotchecking", in parallel, during the transfer, so we don't have to wait until the entire transfer is complete to find out there are errors (won't detect all errors!
        3. Is the destination filesystem growing at a faster rate than we expected? If yes, then warn the user!
      3. Definitely do some spotchecking after the transfer. This should primarily mirror the preliminaries above:
        1. Data blocks match up in a sample of files?
        2. User ownership
        3. Group ownership
        4. Quotas
        5. Permissions bits
        6. ACL's
        7. Linux extended attributes
        8. Did the filesystem run out of room, despite our estimation?



    Hits: 5521
    Timestamp: 2025-01-13 13:08:18 PST

    Back to Dan's tech tidbits

    You can e-mail the author with questions or comments: