First off, note that backshift is not fast (though it's not super-slow if you run it on Pypy or use the Cython chunking).
Backshift is more about being frugal - that is, a modestly-sized backup disk (or RAID) can go a long way.
Backshift works a bit like an rsync-based backup script
(EG: Backup.rsync), but it's intended to be
used solely for backups.
Selecting files for backup
- The selection of files to backup is specified in a manner similar to using cpio: by using
the find command with a -print0 option.
It does not operate over ssh directly, but works well over network filesystems
like sshfs, CIFS or NFS.
For each filename read from stdin, the program will:
- ...chop the file into variable-length, content-based blocks averaging about 2 mebibytes in size
- For each such block:
- Compute a cryptographic digest representing the block
- If that block is not yet present in the repo (under its cryptographic digest):
- ...compress the block using xz
- ...save the block to the repository of backed up files, under its cryptographic digest.
- If the block was already present:
- The block's already-compressed chunk file is touched to forestall expiration.
- ...save file metadata to the repository, again compressed with xz
- Because the blocks are variable length and content based, if you insert a byte at the beginning of an 8 gigabyte file,
only the first block is recompressed and re-saved.
Metadata is stored anew on each backup.
Metadata is stored compressed - directories are only partially compressed, but their filenames and attributes are compressed.
Each directory is compressed separately, minimizing storage requirements while still allow rapid partial restores.
Your first backup with backshift for a given filesystem, will probably be a bit slow. Subsequent
backups should be pretty fast unless there's been a lot of file changes.
You never need to do another fullsave after your first one, for a given set of files.
The author has done fullsaves over wifi (802.11g) - it worked well. Between the xz compression and the
deduplication before the data hits the network, the network use was relatively low.
Incremental behavior
- rsync --link-dest incrementals are normally done relative to the single most recent "similar"
backup by one's rsync wrapper
- Backshift's incrementals are done relative to up to three previous backups, simultaneously:
- The most recent backup found for the (hostname, subset) pair
- The most recent completed backup for the (hostname, subset) pair with > 1 file in it
- The backup with the most files in it, for the (hostname, subset) pair
- Any file that still has the same modification time and length is considered "unchanged", and is not re-read.
- Instead, the file's hashes are obtained from one of the 3 backups listed above (preferring more recent ones),
and the chunks corresponding to these hashes are touched to update their timestamps to forestall
expiration.
- The file's metadata is saved to the repo just as if the file had been read, chunked, compressed and saved
normally.
Expiration allows you to remove old data you no longer care about, to clear up space for new data.
You can set a retention interval for the repo as a whole. You cannot set different retention intervals
for different hosts or different filesystems.
The expiration process will go through each of the following, removing any that are too old:
- Individual chunk files (based on file timestamps)
- Individual "files" metadata (based on the modification time of the top-level files directory for the backup in question)
- Individual saveset summary files (based on a time-of-completion - or last touch, in the case of a
system crash or early backup termination - timestamp stored within the file)