Automated tests
- Generally speaking, the automated tests are more stringent than what production use is going
to require. So if you just want to do some backups and restores without first running the
tests on your system, that's fine.
- It's not a bad idea to run the automated tests on your system before trusting
backshift for production use though.
- To run the automated tests, you need at least one python interpreter located in one of the
spots that ./this-interpreter knows about - you can see where it looks by looking at its
(bash) code.
Specifying what files to back up - overview (see well down this page for "Example use"):
- For file-count-based progress, you need a find that supports -print0; this includes GNU find (most
Linuxes have this) and
the find command found natively on at least two of the BSD's.
- For file-size-based progress, you have two options:
- Use a "find" command that supports -printf; this includes GNU find but not FreeBSD's find. Use
such a find command with the full-prestat progress mode.
- Use a "find" command that supports -print0, in the full-poststat progress mode. This is a little
slower than full-prestat, but should be OK in most situations.
- Faking find -print0:
- You can kind of fake having -print0 with "find / -xdev -print | tr '\012' '\0'".
- Of course, this is going to make a mess of files that newlines in their filenames - which should be rare, but isn't
impossible.
- Generally speaking, file-size-based progress is more accurate than file-count-based progress.
Dependencies:
- A suitable find command.
- Backshift is going to want an xz binary to do compression with, at least until there's a python
module that directly supports xz.
- Filesystems - for remote saves
- sshfs
- CIFS
- NFS
- ...but see the backshift notes for your OS, if any.
- For building / testing:
- You may want SVN (Subversion)
- m4 with which to create the pure python and/or cython rolling checksum modules
Features include:
- Ability to deduplicate data on a variable-sized, content-based blocking
- Just renaming a 1.5 gigabyte file doesn't cause a second copy of that same file to be
stored, unlike rsync-based schemes
- Storing 3 identical copies of your fambly jpeg's on 3 different computers results in a single copy of the
pictures being stored in the repository
- Changing one byte in the middle of a 6 gigabyte file doesn't result in a distinct copy of the whole
file in the repository - only the changed block is stored again
- Compresses deduplicated chunks with xz compression
- Compresses almost all metadata, again with xz compression
- Few to no arbitrary limits on how big files can be - even if you're backing up to
file-size-limited filesystem.
- Ability to expire old data for the repo as a whole (planned, not yet implemented - not down to a host
or filesystem granularity, just repo-granularity)
- Safe, concurrent operation over local or remote filesystems, including but not limited to: NFS, CIFS
and sshfs. The only operation that isn't (designed to be) concurrent-safe, is expiring old files.
- No big filelist inhale at the outset, unless you request a progress report during a backup - similar
to rsync in this regard
- Hybrid fullsaves/incrementals, much like what one gets with an rsync --link-dest backup script - so an
interrupted backup can in a significant sense be subsequently resumed
- Ability to not mess up your buffer cache during a backup (planned, not yet fully implemented)
- A far smaller number of directory entries than a year's worth of daily snapshots with an rsync-based
backup script would give
- Copying a backup repository with 1 year of daily snapshots from one host to another is far more
practical with backshift than rsync --link-dest
- Input files are selected in a manner similar to cpio, using GNU find with -print0 or -printf
- Output is created in GNU tar format; a restore is a matter of piping tar output into a tar process for
extraction. This means there's no restore application to worry about race conditions in other than
tar itself
- No temporary files are necessary on the client system for backups or restores; even a system
with (nearly?) full disks can be backed up
- Easy, no-temp-files (except on Cygwin) backup verification using a pipe to GNU tar's --diff
- Runs on a wide assortment of Python interpreters, including:
- CPython 2.x (with or without Cython, with or without Psyco)
- CPython 3.x (with or without Cython)
- PyPy 1.4.x and 1.5.
- Jython 2.5.2 -r 7288, but not jython 2.5.2; IOW, you would need to check out jython and
build it yourself.
- Backshift is known not to work on IronPython, due to IronPython's lack of a proper standard
library.
- The backup process is cautious about symlink races, at least if the Python interpreter has
an os.fstat (noteably, Jython does not have an os.fstat. CPython 2 and 3, and PyPy do have
os.fstat)
Misfeatures:
- There's currently no way for a user to restore their own files without requiring excessive
trust in users; the administrator needs to get involved.
- During a backup, users can see each others' files; data is not saved in an encrypted format
(but note that sshfs restricts who can see a mount)
- It could conceivably be nice to have host- or filesystem- granularity on expires, but this would require
quite a bit more metadata to be saved
- Disk-to-disk only - Disk-to-tape is not supported
The gist of how it works
- Backshift works a bit like an rsync-based backup script, but it's intended to be used solely
for backups.
- Selecting files
- The selection of files to backup is specified in a manner similar to using cpio: by using
the find command.
- See the example-finds directory for examples of find commands for various OS's
- It does not operate over ssh directly, but works well over network filesystems
like sshfs, CIFS or NFS.
- For each filename read from stdin, the program will chop the file into variable-length blocks
and compress them individually, before writing them to a repository of backed up files.
- Metadata is stored anew on each backup. For this reason, there is no need to sort directories.
- Your first backup with backshift for a given filesystem, will probably be a bit slow. Subsequent
backups should be pretty fast unless there's been a lot of file changes.
- You never need to do another fullsave after your first one, for a given set of files.
- The author has done fullsaves over wifi (802.11g) - it worked well. Between the xz compression and the
deduplication before the data hits the network, the network use was relatively low.
- Incremental behavior
- rsync --link-dest incrementals are normally done relative to the single most recent "similar"
backup by one's rsync wrapper
- Backshift's incrementals are done relative to up to three previous backups, simultaneously:
- The most recent backup found for the (hostname, subset) pair
- The most recent completed backup for the (hostname, subset) pair
- The backup with the most files in it, for the (hostname, subset) pair
Example use
- Backing up
- Note that the first time you use a save directory (repository), you'll need --init-savedir.
- Back up your root filesystem (absent ZFS, which breaks -xdev), with file-count-based progress, creating
the repository if it does not yet exist:
- find / -xdev -print0 | backshift --save-directory /where/ever/save-directory --backup -subset slash --init-savedir
- To pull from an sshfs, which flattens filesystems into a single filesystem, to a local filesystem
(but writing to a remote filesystem is faster than
reading from one), not creating the savedir (note that backing up a ZFS could be done analogously):
- cd /ssh/fs/base
- find . -xdev \( \( \
- -path ./sys -o \
- -path ./dev -o \
- -path ./var/run -o \
- -path ./var/lock -o \
- -name .gvfs \) -prune -o -print0 \) | \
- backshift --save-directory /where/ever/save-directory --backup --subset fullsave
- To back up / with a more accurate progress report. This one is based on the lengths of files; the above two are just
based on file counts:
- find / -xdev -print0 | backshift --save-directory /where/ever/save-directory --backup -subset slash --init-savedir --progress-report full+poststat
- To back up / with a more accurate progress report (assumes your find
supports -printf). This one also is based on the lengths of files, however it's a little faster than the previous example, especially on
large collections of many small files:
- find / -xdev -printf '%s %p\0' | backshift --save-directory /where/ever/save-directory --backup -subset slash --init-savedir --progress-report full+prestat
- To back up / with a minimal progress report - this one does not do a big inhale of filenames at the beginning:
- find / -xdev -print0 | backshift --save-directory /where/ever/save-directory --backup -subset slash --init-savedir --progress-report minimal
- To back up / with no progress report - this one is usually best in cron jobs:
- find / -xdev -print0 | backshift --save-directory /where/ever/save-directory --backup -subset slash --init-savedir --progress-report none
- This one backs up /movie with a progress report, keeping the progress report pretty accurate despite a previous incomplete backup
by using randomize --preserve-directories. Note that this example splits one logical line into multiple physical lines in the manner
of POSIX shells and *csh, by using backslashes on all but the last line:
- find /movie -xdev -printf '%s %p\0' | \
- ~/src/home-svn/backshift/trunk/randomize -0 -v --preserve-directories --skip-size | \
- /usr/local/pypy-1.4.1/bin/pypy ~/src/home-svn/backshift/tags/0.94/backshift \
- --backup \
- --save-directory /mnt/backshift-incremental-test/save-directory \
- --subset movie \
- --progress-report full-prestat
- Restoring
- Overview of process
- First, locate what backups are available to restore from, using --list-backups, and select the best one, for some definition of "best" ^_^
- Second, locate the files within that backup you require using --list-backup --backup-id
- Third, use "--produce-tar --starting-directory | tar xvfp -" to extract the files
- Strictly speaking, you can use --produce-tar with a pipe to "tar tvf -" in the second step
too, but it's much slower.
- Example restore:
- First we list all backups that finished (the last column is None for an unfinished backup). For the sake of discussion, assume the
green backup id is the "best" one:
- # ~/src/home-svn/backshift/trunk/backshift --save-directory /mnt/backshift-incremental-test/save-directory --list-backups | awk ' { if ($4 != "None") print }' | sort
- 1305581966.39_openindiana_export_mon-may-16-14-39-26-2011_6244d94b726da6c6 Mon-May-16-14:39:26-2011 2 Mon-May-16-14:39:26-2011
- 1305583872.56_openindiana_export_mon-may-16-15-11-12-2011_8cfbd6e4f5d87142 Mon-May-16-15:11:12-2011 2 Mon-May-16-15:11:13-2011
- 1305609181.37_openindiana_slash_mon-may-16-22-13-01-2011_04be24c2e608ec32 Mon-May-16-22:13:01-2011 160326 Tue-May-17-13:18:34-2011
- 1305609205.38_openindiana_export_mon-may-16-22-13-25-2011_20abd67bf8d07db3 Mon-May-16-22:13:25-2011 17177 Tue-May-17-04:12:41-2011
- Next we identify what file we need (the red path is its directory):
- # ~/src/home-svn/backshift/trunk/backshift --save-directory /mnt/backshift-incremental-test/save-directory --list-backup --backup-id 1305609205.38_openindiana_export_mon-may-16-22-13-25-2011_20abd67bf8d07db3 2>&1 | egrep -i 'xz.*local-script'
- -rw-r--r-- strombrg/staff 249 2011-05-16 10:32 export/home/strombrg/src/xz/local-script
- Note that in the preceding step, if we already knew the directory, but not the filename, we
could've used the following much more rapidly:
- # ~/src/home-svn/backshift/trunk/backshift --save-directory /mnt/backshift-incremental-test/save-directory --list-backup --backup-id 1305609205.38_openindiana_export_mon-may-16-22-13-25-2011_20abd67bf8d07db3 --starting-directory /export/home/strombrg/src/xz 2>&1
- -rw-r--r-- strombrg/staff 216 2011-05-16 10:07 export/home/strombrg/src/xz/Notes
- -rw-r--r-- strombrg/staff 626 2011-05-16 10:41 export/home/strombrg/src/xz/last-archives
- -rw-r--r-- strombrg/staff 249 2011-05-16 10:32 export/home/strombrg/src/xz/local-script
- drwxr-xr-x strombrg/staff 0 2011-05-16 10:11 export/home/strombrg/src/xz/old/
- -rw-r--r-- strombrg/staff 1023720 2011-04-01 03:11 export/home/strombrg/src/xz/xz-5.0.2.tar.bz2
- -rw-r--r-- strombrg/staff 1270541 2011-04-12 03:49 export/home/strombrg/src/xz/old/xz-5.1.1alpha.tar.gz
- Finally, we extract the file we want:
- # ~/src/home-svn/backshift/trunk/backshift --save-directory /mnt/backshift-incremental-test/save-directory --backup-id 1305609205.38_openindiana_export_mon-may-16-22-13-25-2011_20abd67bf8d07db3 --starting-directory /export/home/strombrg/src/xz --produce-tar | tar xvf - export/home/strombrg/src/xz/local-script
- export/home/strombrg/src/xz/local-script
- Note that during the restore, backshift didn't write to your filesystem; tar did.
- Confirming to an extent that the software is working correctly - here we do a restore to a pipe, and without writing to disk, use GNU tar's --diff
option to compare what's coming out of backshift with what's on the system:
- ~/src/home-svn/backshift/trunk/backshift \
- --save-directory $(pwd) \
- --produce-tar \
- --backup-id 1307122264.51_benchbox_test_fri-jun--3-10-31-04-2011_8bb8b332b1b34ea4 | \
- (cd / && tar --diff)
Copying a backshift repo from one place to another (appropriate when your backup storage server needs to be upgraded).
The -t option is important; mtimes need to be preserved to get accurate expiration. The -l is superfluous:
- rsync -avplt --delete /mnt/backshift-temporary/ /mnt/backshift-production/