For now, this is a list of things to check manually, but someday it
should be assembled into a program or series of programs.
Spot check some files' content - do they have the same digest?
(md5 or sha-1 should be fine, because although they're broken for
cryptographic purposes, it's still extremely unlikely you'll have
an accidental collision. Also, RIPEMD-160, which apparently
hasn't been broken, is a bit less convenient to use in some
applications (due to the lack of a broadly available python API
))
Spot check some file ownerships: both user and group - are they
the same?
Spot check some file permissions bits - are they the same?
Does the filesystem have quotas, either user or group? Check
some to see if they are the same, if so
Does the filesystem have ACL's? Check some to see if they are
the same, if so
Does the filesystem have linux extended attributes (EG, in an
ext3 filesystem)? Check some to see if they are the same, if so
Pseudo-code for doing a transfer with verification
Inputs
directory+host to transfer from
directory+host to transfer to
transfer protocol: rsh, ssh, native rsync
absolute transfer or relative (rsync can be a relative transfer)
archiver: tar, gnu tar, cpio, gnu cpio, afio, pax, dump, ufsdump...
what aspects of the transfer the user is willing to ignore, if
any
Is X11 functional? If yes, then we could give a better
running tally of how things are progressing...
Degree of both final and concurrent spotchecking? Probably
should be able to specify as either an absolute number of files to
check, or as a percentage of files transferred.
Internal table of what's flakey on which platforms
User's $PATH, augmented by appending a few common directories
Interval between "concurrent spotchecks": every n files, every
m minutes, every n blocks, &c. Perhaps allow specifying a
ceiling on the number of concurrent spotchecks to perform,
since they'll tend to slow down the transfer a bit, and the
assurance they give is a case of diminishing returns. Might be
nice to have an exponential backoff...
Directory to which we do extractions for spot checks, perhaps
specified separately for final and concurrent. Could be on the
source system or the destination system, and could be in the
same directory hierarchy as the source or destination system,
or another hierarchy entirely.
block sizes to use on producer and consumer
favor security or speed?
Outputs
new data on destination directory+host
any needed error messages from during the transfer
verification results
total time of transfer
rate of entire transfer
Algorithm
Preliminaries
Check if quotas are enabled in source. Check both user and
group. If requested archiver does not handle
quotas, ask user if quotas can safely be ignored. Might favor
an archiver that handles quotas if required
Check if ACL's are in use in source. If requested archiver
does not handle ACL's, ask user if ACL's can safely be ignored
transfer. Might favor an archiver that handles ACL's if
required.
Check if linux extended attributes are in use in source.
If requested archiver does not handle extended attributes,
ask user if extended attributes can safely be ignored
transfer. Might favor an archiver that handles extended
attributes if required.
Check if source filesystem has files > 2 gigabytes. If
yes, then check if source filesystem can support files greater
than 2 gigabytes. If not, ask user if this can safely be
ignored. Might favor an archiver that handles > 2 gig
files, if required.
Check size of source data. Can it reasonably be expected to
fit into the destination directory? If not, inform the user
before we start! Sparse files may complicate this a bit, but a
close estimate is probably sufficient anyway. Might want a
tolerance range for this test.
Might have an option to automatically try to guess what kind
of compression, if any, would be best: gzip, bzip2, probably
not rzip unless perhaps it could be used on individual
files....
Might have an option to automatically try to guess the
throughput of the network connection (if any).
The data transfer
The actual data transfer, assuming prereq's are satisfied
Possibly do some "concurrent spotchecking", in parallel,
during the transfer, so we don't have to wait until the entire
transfer is complete to find out there are errors (won't
detect all errors!
Is the destination filesystem growing at a faster rate than
we expected? If yes, then warn the user!
Definitely do some spotchecking after the transfer. This
should primarily mirror the preliminaries above:
Data blocks match up in a sample of files?
User ownership
Group ownership
Quotas
Permissions bits
ACL's
Linux extended attributes
Did the filesystem run out of room, despite our estimation?