The gist of how it works

First off, note that backshift is not fast (though it's not super-slow if you run it on Pypy or use the Cython chunking). Backshift is more about being frugal - that is, a modestly-sized backup disk (or RAID) can go a long way.

Backshift works a bit like an rsync-based backup script (EG: Backup.rsync), but it's intended to be used solely for backups.

Selecting files for backup

The selection of files to backup is specified in a manner similar to using cpio: by using the find command with a -print0 option.

It does not operate over ssh directly, but works well over network filesystems like sshfs, CIFS or NFS.

For each filename read from stdin, the program will:

...chop the file into variable-length, content-based blocks averaging about 2 mebibytes in size
For each such block:
- Compute a cryptographic digest representing the block
- If that block is not yet present in the repo (under its cryptographic digest):
  - ...compress the block using xz
  - ...save the block to the repository of backed up files, under its cryptographic digest.
- If the block was already present:
  - The block's already-compressed chunk file is touched to forestall expiration.
...save file metadata to the repository, again compressed with xz
Because the blocks are variable length and content based, if you insert a byte at the beginning of an 8 gigabyte file, only the first block is recompressed and re-saved.

Metadata is stored anew on each backup.

Metadata is stored compressed - directories are only partially compressed, but their filenames and attributes are compressed.

Each directory is compressed separately, minimizing storage requirements while still allow rapid partial restores.

Your first backup with backshift for a given filesystem, will probably be a bit slow. Subsequent backups should be pretty fast unless there's been a lot of file changes.

You never need to do another fullsave after your first one, for a given set of files.

The author has done fullsaves over wifi (802.11g) - it worked well. Between the xz compression and the deduplication before the data hits the network, the network use was relatively low.

Incremental behavior

rsync --link-dest incrementals are normally done relative to the single most recent "similar" backup by one's rsync wrapper
Backshift's incrementals are done relative to up to three previous backups, simultaneously:
- The most recent backup found for the (hostname, subset) pair
- The most recent completed backup for the (hostname, subset) pair with > 1 file in it
- The backup with the most files in it, for the (hostname, subset) pair
Any file that still has the same modification time and length is considered "unchanged", and is not re-read.
- Instead, the file's hashes are obtained from one of the 3 backups listed above (preferring more recent ones), and the chunks corresponding to these hashes are touched to update their timestamps to forestall expiration.
- The file's metadata is saved to the repo just as if the file had been read, chunked, compressed and saved normally.

Expiration allows you to remove old data you no longer care about, to clear up space for new data.

You can set a retention interval for the repo as a whole. You cannot set different retention intervals for different hosts or different filesystems.

The expiration process will go through each of the following, removing any that are too old:

Individual chunk files (based on file timestamps)
Individual "files" metadata (based on the modification time of the top-level files directory for the backup in question)
Individual saveset summary files (based on a time-of-completion - or last touch, in the case of a system crash or early backup termination - timestamp stored within the file)