Backshift works a bit like an rsync-based backup script, but it's intended to be used solely
for backups.
Selecting files for backup
- The selection of files to backup is specified in a manner similar to using cpio: by using
the find command.
- See the example-finds directory for examples of find commands for various OS's
- I'll likely add a more tar-like mode for conducting backups at some point, but for now, the cpio-like method works fine
It does not operate over ssh directly, but works well over network filesystems
like sshfs, CIFS or NFS.
For each filename read from stdin, the program will:
- ...chop the file into variable-length blocks averaging about 2 mebibytes in size
- For each such block:
- ...compute a cryptographic digest representating the block
- ...compress the block using xz
- ...save the block to a repository of backed up files, under its cryptographic digest - but only if the repo
doesn't already have a copy of that particular block (digest)
- ...save file metadata to the repository, again compressed with xz
Metadata is stored anew on each backup. For this reason, there is no need to sort directories.
Metadata is stored compressed - directories are only partially compressed, but their content is compressed.
Your first backup with backshift for a given filesystem, will probably be a bit slow. Subsequent
backups should be pretty fast unless there's been a lot of file changes.
You never need to do another fullsave after your first one, for a given set of files.
The author has done fullsaves over wifi (802.11g) - it worked well. Between the xz compression and the
deduplication before the data hits the network, the network use was relatively low.
Incremental behavior
- rsync --link-dest incrementals are normally done relative to the single most recent "similar"
backup by one's rsync wrapper
- Backshift's incrementals are done relative to up to three previous backups, simultaneously:
- The most recent backup found for the (hostname, subset) pair
- The most recent completed backup for the (hostname, subset) pair
- The backup with the most files in it, for the (hostname, subset) pair
Expiration will go through each of the following, removing any that are too old:
- Individual chunk files (based on file timestamps)
- Individual "files" files (again, based on timestamps)
- Individual saveset summary files (based on a time-of-completion (or last touch, in the case of a system crash or early backup termination) timestamp
stored within the file)