First off, note that backshift is not fast (though it's not super-slow if you run it on Pypy).
Backshift is more about being frugal - that is, a modestly-sized backup disk (or RAID) can go a long way.
Backshift works a bit like an rsync-based backup script
(EG: Backup.rsync, but it's intended to be
used solely for backups.
Selecting files for backup
- The selection of files to backup is specified in a manner similar to using cpio: by using
the find command with a -print0 option.
It does not operate over ssh directly, but works well over network filesystems
like sshfs, CIFS or NFS.
For each filename read from stdin, the program will:
- ...chop the file into variable-length, content-based blocks averaging about 2 mebibytes in size
- For each such block:
- ...compute a cryptographic digest representing the block
- ...compress the block using xz
- ...save the block to a repository of backed up files, under its cryptographic digest - but only if the repo
doesn't already have a copy of that particular block (digest)
- ...save file metadata to the repository, again compressed with xz
- Because the blocks are variable length and content based, if you insert a byte at the beginning of an 8 gigabyte file,
only the first block is recompressed and re-saved.
Metadata is stored anew on each backup.
Metadata is stored compressed - directories are only partially compressed, but their filenames and attributes are compressed.
Your first backup with backshift for a given filesystem, will probably be a bit slow. Subsequent
backups should be pretty fast unless there's been a lot of file changes.
You never need to do another fullsave after your first one, for a given set of files.
The author has done fullsaves over wifi (802.11g) - it worked well. Between the xz compression and the
deduplication before the data hits the network, the network use was relatively low.
Incremental behavior
- rsync --link-dest incrementals are normally done relative to the single most recent "similar"
backup by one's rsync wrapper
- Backshift's incrementals are done relative to up to three previous backups, simultaneously:
- The most recent backup found for the (hostname, subset) pair
- The most recent completed backup for the (hostname, subset) pair with > 1 file in it
- The backup with the most files in it, for the (hostname, subset) pair
Expiration will go through each of the following, removing any that are too old:
- Individual chunk files (based on file timestamps)
- Individual "files" metadata (again based on timestamps)
- Individual saveset summary files (based on a time-of-completion (or last touch, in the case of a
system crash or early backup termination) timestamp stored within the file)