• Automated tests
  • Specifying what files to back up - overview (see well down this page for "Example use"):
  • Dependencies:
  • Features include:
    1. Ability to deduplicate data on a variable-sized, content-based blocking
      • Just renaming a 1.5 gigabyte file doesn't cause a second copy of that same file to be stored, unlike rsync-based schemes
      • Storing 3 identical copies of your fambly jpeg's on 3 different computers results in a single copy of the pictures being stored in the repository
      • Changing one byte in the middle of a 6 gigabyte file doesn't result in a distinct copy of the whole file in the repository - only the changed block is stored again
    2. Compresses deduplicated chunks with xz compression
    3. Compresses almost all metadata, again with xz compression
    4. Few to no arbitrary limits on how big files can be - even if you're backing up to file-size-limited filesystem.
    5. Ability to expire old data for the repo as a whole (planned, not yet implemented - not down to a host or filesystem granularity, just repo-granularity)
    6. Safe, concurrent operation over local or remote filesystems, including but not limited to: NFS, CIFS and sshfs. The only operation that isn't (designed to be) concurrent-safe, is expiring old files.
    7. No big filelist inhale at the outset, unless you request a progress report during a backup - similar to rsync in this regard
    8. Hybrid fullsaves/incrementals, much like what one gets with an rsync --link-dest backup script - so an interrupted backup can in a significant sense be subsequently resumed
    9. Ability to not mess up your buffer cache during a backup (planned, not yet fully implemented)
    10. A far smaller number of directory entries than a year's worth of daily snapshots with an rsync-based backup script would give
    11. Copying a backup repository with 1 year of daily snapshots from one host to another is far more practical with backshift than rsync --link-dest
    12. Input files are selected in a manner similar to cpio, using GNU find with -print0 or -printf
    13. Output is created in GNU tar format; a restore is a matter of piping tar output into a tar process for extraction. This means there's no restore application to worry about race conditions in other than tar itself
    14. No temporary files are necessary on the client system for backups or restores; even a system with (nearly?) full disks can be backed up
    15. Easy, no-temp-files (except on Cygwin) backup verification using a pipe to GNU tar's --diff
    16. Runs on a wide assortment of Python interpreters, including:
      • CPython 2.x (with or without Cython, with or without Psyco)
      • CPython 3.x (with or without Cython)
      • PyPy 1.4.x and 1.5.
      • Jython 2.5.2 -r 7288, but not jython 2.5.2; IOW, you would need to check out jython and build it yourself.
    17. Backshift is known not to work on IronPython, due to IronPython's lack of a proper standard library.
    18. The backup process is cautious about symlink races, at least if the Python interpreter has an os.fstat (noteably, Jython does not have an os.fstat. CPython 2 and 3, and PyPy do have os.fstat)
  • Misfeatures:
    1. There's currently no way for a user to restore their own files without requiring excessive trust in users; the administrator needs to get involved.
    2. During a backup, users can see each others' files; data is not saved in an encrypted format (but note that sshfs restricts who can see a mount)
    3. It could conceivably be nice to have host- or filesystem- granularity on expires, but this would require quite a bit more metadata to be saved
    4. Disk-to-disk only - Disk-to-tape is not supported
  • The gist of how it works
  • Example use
  • Copying a backshift repo from one place to another (appropriate when your backup storage server needs to be upgraded). The -t option is important; mtimes need to be preserved to get accurate expiration. The -l is superfluous: