You can do basically the same thing with ssh too. It's slower, but
sometimes easier, and always more secure.
The &'s above are preferred over semicolons, because with the
&'s, if you typo on a directory, the command will end relatively
quickly, without doing any actual copying, and won't accidentally copy
from the wrong place, or worse, to the wrong place.
On linux, of course, tar is really gtar, so you can drop the g and use
an alternate path.
The impact of compression
If you're transferring a lot of data over a low-bandwidth link, it
sometimes improves performance to compress on the source machine, and
uncompress on the dest machine, with something like:
If you're transferring a mix of compressible and not-very-compressible data, check out can! It looks pretty interesting
About how well files will compress:
Examples of things that tend to not be very compressible
Already compressed files:
gzip archives
bzip2 archives
rzip archives
zip archives
Many movie formats (raw DVD images tend to be more compressible than most other formats, if you're in the
habit of backing up your DVD's - and be aware that often how compressible a movie will be a matter of
what's inside the container format, and is frequently not an attribute o the container format itself like
.mpeg or .mkv)
Many music formats (see notes about above container formats)
Many picture formats (.gif, .png, jpeg, but .ppm may be pretty compressible, and .gif is more compressible
than .png or .jpeg)
Largely random data like that from "cat /dev/random'
Examples of files that do tend to compress well:
Text files:
Many wordprocessor file formats, but openofice formats are sometimes well compressed already
Source code, like that written in C, C++ or Java
Interpreted language code, like that written in Python, Ruby, Perl, or TCL
Binary files
If you have a file full of a small number of distinct floating point numbers, repeated again and again,
that will often be pretty compressible
Any other binary format that has lots of repetition, will tend to be pretty compressible
If you want a running update on how things are progressing, you can add
reblock to your pipeline like this:
rsync can be really helpful if most of the data in the /dest/dir is
identical to the data in the /src/dir. This opens the door to
doing a full copy via tar|rsh 'tar' or something, with users still
active in the filesystem, and then doing an rsync with the users kicked
out of the filesystem, to just copy over what's changed since the tar
pipeline.
Usage looks like:
cd /src/dir && rsync -a --numeric-ids --compress --progress --rsh=rsh --delete --rsync-path=/dcs/packages/gnu/bin/rsync . dest.host.uci.edu:/dest/dir
I should add that native rsync (IE, hanging an rsync off of
inetd/xinetd and connecting to that) turns out to be a good choice
for fast datatransfers on a gigabit network with jumbo frames
enabled. Despite using NFS v3 over TCP with 8k rsize/wsize, native
rsync was still 344% faster transferring data from a Redhat 9 system
to an RHEL 3 system.
Compression helps on slow links, but makes things worse on slow CPU's,
same as for the rsh pipelines above. The progress info from rsync
is very different from that of reblock -t, but both are useful.
--rsh defaults to ssh these days.
Copying data over remote filesystems like NFS, GFS, AFS, Intermezzo,
Lustre or other similar filesystems is best avoided if possible; they
tend to be slower than rsh or ssh in most cases. Also, if you're
copying over a remote filesystem, compression (apart from that done by
the filesystem itself) is unlikely to help speed up the copy. However,
some filesystems, including GFS and Lustre, do not allow copying into a
disk-based filesystem directly; you must go through the remote
filesystem. Also, when copying from one remote filesystem to another,
rsync may do excessive reading for files that have already been
transferred. At least in the case of NFS, writing tends to be slower
than reading, so if you have a choice, choice NFS reads over NFS writes.
Copying from one disk to another on the same system can be achieved with
a tar pipeline or rsync command:
You may even want to use an even larger blocksize on the reblock (the
1048576 is the blocksize).
Note: at one time I had hypothesized that:
This is because when copying from one
partition to another on the same disk, the disk will end up doing many
track-to-track seeks unless you use a large buffer in between the reads
and the writes. However, this does not
appear to be true.
Copying from a single partition on a disk to another directory in the
same partition of the same disk... If you don't want to just mv :), then
you have a couple of options. You can do the same thing as above on
copying to a different partition on the same disk, or you have the
option of using the following, which creates hard links (one file, two
names) for each file you are "copying":
cd /src/dir && find . -print | cpio -pdlum /dst/dir
(BTW, this one is untested, and I don't use it that often either, so use
with caution)
OS-specific Notes
SunOS
If you're transferring lots of small files to a Sun, it can speed things
up considerably to do:
fastfs /dest/dir fast
...followed by:
fastfs /dest/dir slow
...when you are done. Note that this makes the filesystem somewhat more
prone to losing data in the event of a crash. Note that the
relative fragility goes away after you run fastfs /dest/dir slow
or reboot. It's not all that dangerous though - the fastfs command is
based on something a Sun backup+restore product uses, and some
filesystems operate in such a mode all the time.
AIX 5.1 (and RHEL 3)
Copying from AIX 5.1 to AIX 5.1 using rsh is horribly slow - less
than 1 Megabit/second on 100BaseT FD. However, using ssh gets
multiple megabits/second. But copying from AIX 5.1 to RHEL
3 is faster with rsh than with ssh. This may mean that AIX 5.1
has a slow rshd, but its rsh is OK.