This is a backup application, written in Python.

It works a bit like an rsync-based backup script, but it's intended to be used solely for backups.

Features include:
1) Ability to deduplicate data on a variable-sized, content-based blocking
   a) Just renaming a 1.5 gigabyte file doesn't cause a second copy of that same file to be stored, unlike
	   rsync-based schemes
	b) Storing 3 copies of your fambly jpeg's on 3 different computers results in a single copy of the
	   pictures being stored in the repository
	c) Changing one byte in the middle of a 6 gigabyte file doesn't result in a distinct copy of the whole
	   file in the repository
2) Ability to compress deduplicated chunks with xz compression (planned, not yet fully implemented)
3) Ability to expire old data for the repo as a whole (planned, not yet implemented - not down to a host
   or filesystem granularity, just repo-granularity)
4) Safe, concurrent operation over local or remote filesystems, including but not limited to: NFS, CIFS
   and sshfs (could use a little improvement with temp files and renames though).  The only operation
	that isn't (designed to be) concurrent-safe, is expiring old files.
5) No big filelist inhale at the outset, unless you request a progress report during a backup - similar
   to rsync in this regard
6) Hybrid fullsaves/incrementals, much like what one gets with an rsync --link-dest backup script - so an
   interrupted backup can in a significant sense be subsequently resumed
7) Ability to not mess up your buffer cache during a backup (planned, not yet fully implemented)
8) A far smaller number of directory entries than a year's worth of daily snapshots with an rsync-based
   backup script would give
9) Copying a backup repository with 1 year of daily snapshots from one host to another is far more
   practical with backshift than rsync --link-dest
10) Input files are selected in a manner similar to cpio, using GNU find with -print0
11) Output is created in GNU tar format; a restore is a matter of piping tar output into a tar process for
   extraction.  This means there's no restore application to worry about race conditions in other than
	tar itself
12) No temporary files are necessary for backups or restores; even a system with (nearly?) full disks can be
   backed up
13) Runs on a wide assortment of Python interpreters, including CPython 2.x (with or without Cython, with or
   without Psyco), CPython 3.x (with or without Cython), and PyPy.  Of these, PyPy is by far the fastest,
	though with a little more tuning, it's possible the Cythonized versions will improve significantly.  Also
	runs on jython 2.5.2 -r 7288, but not jython 2.5.2; IOW, you need to check out jython and build it yourself.
14) The backup process is cautious about symlink races, at least if the Python interpreter has an os.fstat
   (noteably, Jython does not have an os.fstat.  CPython 2 and 3, and PyPy do have os.fstat)

Misfeatures:
1) There's currently no way for a user to restore their own files without requiring excessive trust in users;
   the administrator needs to get involved.
2) During a backup, users can see each others' files; data is not saved in an encrypted format
3) It could conceivably be nice to have host- or filesystem- granularity on expires, but this would require
   quite a bit more metadata to be saved
4) Disk-to-disk only - Disk-to-tape is not supported

----------------------------------------------------------------------

About using backshift:

	For a backup:
		Example 1: To do a backup of a system's root filesystem, to a filesystem on that same system, this should work:
			find / -xdev -print0 | backshift --save-directory /where/ever/save-directory --backup
			Of course, you don't want /where/ever/save-directory to be in the root filesystem!

		Example 2: To pull from an sshfs, to a local filesystem (but writing to a remote filesystem is faster than
		   reading from one):
			cd /ssh/fs/base
			find . -xdev \( \( \
				-path ./sys -o \
				-path ./dev -o \
				-path ./var/run -o \
				-path ./var/lock -o \
				-name .gvfs \) -prune -o -print0 \) | \
				backshift --save-directory /where/ever/save-directory --backup --init-savedir --subset fullsave
			One does the -path's and -prune, because sshfs doesn't distinguish between the different filesystems of the
				machine you're pulling from, so they're all one filesystem to find -xdev.  The -name .gvfs is pruned because
				it causes problems, so we avoid it.

		If a backup takes forever to say it's inhaled 10,000 filenames, there's a good chance you've used -print instead of
			-print0.

		If you have a huge filesystem to back up, and inhaling the whole list of files would overwhelm your VM system,
			use --no-stats.  This turns off the progress report during the backup, but should take much less VM.

	For a restore:
		First, locate what backups are available to restore from, using --list-backups
		Second, locate the files within that backup you require using --list-backup --backup-id
		Third, use --produce-tar --starting-directory | tar xvfp -

		Strictly speaking, you can use --produce-tar with a pipe to "tar tvf -" in the second step too, but it's much
			slower.

----------------------------------------------------------------------

BTW, about the statistics listed during a backup:

1) It's assumed that all files take the same amount of time to process, on
   average.  Doing so isn't as accurate as something like considering the
   number of bytes each file uses, but this is simpler in more ways than one.

2) So if you have one directory with 500 movies about programming in Python, and
   another directory with 500 text files containing recipes about cooking, then the
   statistics generated will be pretty off.  If the movies are backed up first, then
   it's going to expect all the recipes to take the same amount of time the movies
	did - initially.  By the end of the second 500 hundred, it should have a pretty
	clear idea of average duration per file.

3) One way of dealing with this, is to use the "randomize" script that appears in
   this directory.  You can use it as a filter in between your "find -print0" and
	your "backshift".  Make sure to give it the -0 option.  In this way, you'll
	roughly alternate between movie files and recipes.  Randomizing the order of
	the files can be expected to make a backup take a little longer (because various
	directory caches will miss alot), but the statistics should be more accurate.
	It could even take a lot longer if you have a lot of large directories.

----------------------------------------------------------------------

About choice of Python runtime:

1) Backshift runs, unmodified, on:
   A) CPython 2.[567]
	B) CPython 3.[012]
	C) PyPy 1.4.x (much like CPython 2.5.x) and 1.5 (much like CPython 2.7.x)
	E) Jython Release_2_5maint -r 7288 (it is known to not work on Jython 2.5.2,
	   but a bugfix was checked into 2.5 maint shortly after the 2.5.2 release
		that enables backshift on Jython).

2) Backshift has some issues on Jython 2_5maint -r 7288:
	A) Jython has no os.fstat, so the fstat verification is turned off when
	   running on Jython.  This means symlink races are possible.  IOW,
		running backshift on Jython as root is not extremely secure.
	B) Jython has no os.major or os.minor, so backing up device files is
	   impractical (short of spawning an "ls" subprocess or similar)

3) Backshift doesn't run at all on IronPython 2.6 Beta 2, at least not
	the Ubuntu 11.04 version; its idea of a standard library is to just
	put CPython's modules on its import path.  FePY attempts to include
	a standard library that works with IronPython, but but I haven't
	tried that, having found no prepackaged version of FePY for Ubuntu.

This enables one to select the fastest runtime that one trusts for
running backshift (unless one is a big IronPython fan).

There is one advantage that appears to be unique to CPython 3.2 and PyPy
1.5 (so far): their os.stat().st_mtime give microsecond resolution, not
just 100ths of seconds like the others.  This means that incrementals
can be more precise - which would be nice in the rare event that a
change is made to a file, it's backed up, and then another change is
made - all within the same 100th of a second.  However, the code is
currently written to always check mtimes to a precision of 0.01, to get
the floats to compare correctly.

----------------------------------------------------------------------