Copying lots of data from one place to another on unix/linux

To keep things simple, you might try drsbackup.

I usually use something like:

(cd /src/dir && gtar cflS - .) | rsh dest.host.uci.edu 'cd /dest/dir && /dcs/bin/gtar xfp -'

 (cd / && gtar --one-file-system -cSf - .) | (cd /mnt/foo/src && time gtar xfp -)

You can do basically the same thing with ssh too. It's slower, but sometimes easier, and always more secure.

The &'s above are preferred over semicolons, because with the &'s, if you typo on a directory, the command will end relatively quickly, without doing any actual copying, and won't accidentally copy from the wrong place, or worse, to the wrong place.

On linux, of course, tar is really gtar, so you can drop the g and use an alternate path.

The impact of compression

If you're transferring a lot of data over a low-bandwidth link, it sometimes improves performance to compress on the source machine, and uncompress on the dest machine, with something like:

(cd /src/dir && gtar cflS - .) | /dcs/bin/gzip | rsh dest.host.uci.edu 'cd /dest/dir && /dcs/bin/gunzip -c | /dcs/bin/gtar xfp -'

Or if you're happy leaving the data compressed on the destination machine, that may be faster still:

(cd /src/dir && gtar cflS - .) | /dcs/bin/gzip | rsh dest.host.uci.edu 'cd /dest/dir && cat > archive.tar.gz'

If you're transferring more text than binaries, it's better to use bzip2 instead of gzip.

(cd /src/dir && gtar cflS - .) | /dcs/bin/bzip2 | rsh dest.host.uci.edu 'cd /dest/dir && cat > archive.tar.bz2'

If you're transferring a mix of compressible and not-very-compressible data, check out can! It looks pretty interesting
About how well files will compress:
- Examples of things that tend to not be very compressible
  - Already compressed files:
    - gzip archives
    - bzip2 archives
    - rzip archives
    - zip archives
    - Many movie formats (raw DVD images tend to be more compressible than most other formats, if you're in the habit of backing up your DVD's - and be aware that often how compressible a movie will be a matter of what's inside the container format, and is frequently not an attribute o the container format itself like .mpeg or .mkv)
    - Many music formats (see notes about above container formats)
    - Many picture formats (.gif, .png, jpeg, but .ppm may be pretty compressible, and .gif is more compressible than .png or .jpeg)
  - Largely random data like that from "cat /dev/random'
- Examples of files that do tend to compress well:
  - Text files:
    - Many wordprocessor file formats, but openofice formats are sometimes well compressed already
    - Source code, like that written in C, C++ or Java
    - Interpreted language code, like that written in Python, Ruby, Perl, or TCL
  - Binary files
    - If you have a file full of a small number of distinct floating point numbers, repeated again and again, that will often be pretty compressible
    - Any other binary format that has lots of repetition, will tend to be pretty compressible

If you want a running update on how things are progressing, you can add reblock to your pipeline like this:

(cd /src/dir && gtar cflS - .) | rsh dest.host.uci.edu 'cd /dest/dir && /dcslib/allsys/etc/reblock -t 1048576 30 | /dcs/bin/gtar xfp -'

rsync can be really helpful if most of the data in the /dest/dir is identical to the data in the /src/dir. This opens the door to doing a full copy via tar|rsh 'tar' or something, with users still active in the filesystem, and then doing an rsync with the users kicked out of the filesystem, to just copy over what's changed since the tar pipeline.
Usage looks like:

cd /src/dir && rsync -a --numeric-ids --compress --progress --rsh=rsh --delete --rsync-path=/dcs/packages/gnu/bin/rsync . dest.host.uci.edu:/dest/dir

I should add that native rsync (IE, hanging an rsync off of inetd/xinetd and connecting to that) turns out to be a good choice for fast datatransfers on a gigabit network with jumbo frames enabled. Despite using NFS v3 over TCP with 8k rsize/wsize, native rsync was still 344% faster transferring data from a Redhat 9 system to an RHEL 3 system.

Compression helps on slow links, but makes things worse on slow CPU's, same as for the rsh pipelines above. The progress info from rsync is very different from that of reblock -t, but both are useful. --rsh defaults to ssh these days.

Copying data over remote filesystems like NFS, GFS, AFS, Intermezzo, Lustre or other similar filesystems is best avoided if possible; they tend to be slower than rsh or ssh in most cases. Also, if you're copying over a remote filesystem, compression (apart from that done by the filesystem itself) is unlikely to help speed up the copy. However, some filesystems, including GFS and Lustre, do not allow copying into a disk-based filesystem directly; you must go through the remote filesystem. Also, when copying from one remote filesystem to another, rsync may do excessive reading for files that have already been transferred. At least in the case of NFS, writing tends to be slower than reading, so if you have a choice, choice NFS reads over NFS writes.

Copying from one disk to another on the same system can be achieved with a tar pipeline or rsync command:

(cd /src/dir && gtar cflS - .) | (cd /dst/dir && gtar xfp -)
cd /src/dir && rsync -a --numeric-ids --progress --delete . /dest/dir

But be careful with your pathnames! This one copies from /spare to /export/home/spare

rsync -a --numeric-ids --relative --delete --progress --stats --one-file-system /spare/ /export/home/

Copying from a disk to another partition on the same disk, can be accomplished like the following:

(cd /src/dir && gtar cflS - .) | (cd /dest/dir && /dcslib/allsys/etc/reblock -t 1048576 30 | /dcs/bin/gtar xfp -)

You may even want to use an even larger blocksize on the reblock (the 1048576 is the blocksize).

Note: at one time I had hypothesized that:

not appear to be true.

Copying from a single partition on a disk to another directory in the same partition of the same disk... If you don't want to just mv :), then you have a couple of options. You can do the same thing as above on copying to a different partition on the same disk, or you have the option of using the following, which creates hard links (one file, two names) for each file you are "copying":

cd /src/dir && find . -print | cpio -pdlum /dst/dir

(BTW, this one is untested, and I don't use it that often either, so use with caution)

OS-specific Notes

SunOS

fastfs /dest/dir fast

fastfs /dest/dir slow

that

AIX 5.1 (and RHEL 3)

But

It's extremely important to check your work

If you find that doing a large data transfer is making a machine uncomfortable to use, than you may want to read this

Thanks.

-- 
Dan Stromberg DCS/NACS/UCI <strombrg@dcs.nac.uci.edu>

Hits: 9966
Timestamp: 2025-07-29 19:11:51 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: