Features include:
- Ability to deduplicate data on a variable-sized, content-based blocking
- Just renaming a 1.5 gigabyte file doesn't cause a second copy of that same file to be
stored, unlike rsync-based schemes
- Storing 3 identical copies of your fambly jpeg's on 3 different computers results in only a single copy of the
pictures being stored in the repository
- Changing one byte in the middle of a 6 gigabyte file doesn't result in a distinct copy of the whole
file in the repository - only the changed block is stored again
- Compresses deduplicated chunks with xz compression (falling back on bzip2 if necessary)
- Compresses almost all metadata, again with xz compression (again falling back on bzip2 if necessary)
- Few to no arbitrary limits on how big files can be - even if you're backing up to a
file-size-limited filesystem.
- Ability to expire old data for the repo as a whole.
- Safe, concurrent operation over local or remote filesystems, including but not limited to: NFS, CIFS
and sshfs. The only operation that isn't (designed to be) concurrent-safe, is expiring old files.
- No big filelist inhale at the outset is necessary, but if you allow one, you'll get a nice progress report
as a result.
- Hybrid fullsaves/incrementals, much like what one gets with an rsync --link-dest backup script - so an
interrupted backup can in a significant sense be subsequently resumed
- Ability to not mess up your buffer cache during a backup (planned, not yet fully implemented)
- A far smaller number of directory entries than a year's worth of daily snapshots with an rsync-based
backup script would give
- Copying a backup repository with 1 year of daily snapshots from one host to another is far more
practical with backshift than rsync --link-dest
- Input files are selected in a manner similar to cpio, using GNU find with -print0
- Output is created in tar format; a restore is a matter of piping tar output into a tar process for
extraction. This means there's no restore application to worry about race conditions in other than
tar itself
- No temporary files are necessary on the client system for backups or restores; even a system
with (nearly?) full disks can be backed up (except on Cygwin, where a large number of
small temporary files are written and read, but there's only one on disk at a given time).
- Easy, no-temp-files (except on Cygwin) backup verification using a pipe to GNU tar's --diff
- Runs on a wide assortment of Python interpreters, including:
- CPython 3.[0123456789] (with or without Cython)
- PyPy3
- The backup process is cautious about symlink races, at least if the Python interpreter has
an os.fstat (CPython and PyPy do have os.fstat)
- Backshift compresses data pretty hard through its deduplication and use of xz. EG on 2014-10-05, I calculated that I have
a smattering of gig over 2.3 terabytes of data in use that I'm backing up at home, and 1 year of backshift snapshots of that data came
to only 2.4 terabytes, including metadata.
Misfeatures:
- There's currently no way for a user to restore their own files without requiring excessive
trust in users; the administrator needs to get involved.
- During a backup, users can see each others' files; data is not saved in an encrypted format
(but note that sshfs restricts who can see a mount)
- It could conceivably be nice to have host- or filesystem- granularity on expires, but this would require
that quite a bit more metadata be saved
- Disk-to-disk only - Disk-to-tape is not supported
- It's not super fast - especially the first time you backup a filesystem.
- Backshift is known not to work on IronPython due to IronPython's lack of a python standard library.