+ Interpreters + CPython 2.x + CPython 3.x + PyPy + Jython - FePy? Close to IronPython, but reportedly has a better standard library. - comparison - check into - BackupPC - Nice article about BackupPC behind the scenes, include dedup: - http://stanlarson.com/wordpress/?p=118 - SONAS (IBM on GPFS) - OpenDedup - DataDomain - veritas netbackup + Documentation + Convert .ps to .pdf - performance tuning - parallelism - dedup? - compression? - profile it again - rolling_checksum_pyx_mod could benefit from quite a bit more optimization, in light of the html report; CPython might even overtake PyPy in performance with some Cython tuning. In short, there's too much yellow in the report. / backups - I suspect that when a backup contained no hardlinks (windows), no hardlinks/backup-id directory is created, causing a restore to traceback with No such file or directory - filesystem size estimates! Right now, we always assume 1,000,000 - We're getting "stat" for things that shouldn't be stat's? - traceback on disappearing file: - data_file_handle = os.open(filename, os.O_RDONLY) - OSError: [Errno 2] No such file or directory - SIGHUP to flush to disk and exit? - SIGUSR1 to flush to disk? - Some sort of heartbeat might be nice during a backup of a large file, particularly in light of NFS and sshfs's tendency to silently hang / exotic file attributes + setuid + setgid + sticky - POSIX ACL's - Linux capabilities (?) - Linux xattr metadata? - Windows streams? - MacOS resource forks? - MacOS xattr metadata? + A --compare mode would be nice, so that you can get a list of files that've changed since a particular backup. + Then again, ISTR gtar supports something like this, and we produce gtar output. If it works on a pipe, it might provide the benefit. + Yes, it turns out that gnu tar can do --diff from a pipe, so this is a freebie :) + on incrementals, don't check the mtime alone. Check the file size too. + Might try using db_mod.error for the exception in cacher_mod + ETA's are off in full progress mode (this isn't broken - some clocks were) + "fast" backup needs to touch chunks for accurate expiration! + temp file and rename for chunk files, to achieve greater concurrency safety on hashed files + Fix the chdir situation. cd once and stay there, don't hop all over. + Fix the list vs str path stuff. We should probably just use paths with os.path.dirname as needed + Compression + Put compression info at the beginning of the .data files; leave the .time file alone. + The .time files should be single-purpose, because they are the only part that needs to change during an incremental. + use xz_mod for now, later use the standard library's xz module. + possibly use bz2 module, because on windows exec'ing xz is going to be slow. + Not going to. No need. + write xz_mod + Compressing data chunks in content hierarchy + Skipping compression of chunks that don't compress well + Uncompressing files and recompressing as chunks + Dealing with files that don't uncompress! + Compression of metadata too? + A compression shell script to run on the server asynchronously? That'd mean we'd only uncompress on the client during a backup, not recompress... I kind of like that. + decided against this + the directory entry (entries dohdbm) caching is a bit aggressive - with huge directories full of many files and few directory entries, it's possible to backup for days without writing any directory entries to speak of. It might be good to flush them to disk every now and then, before it's "needed". This is a cacher_mod thing. + Might divide the cache size by the number of relevant savesets; I originally increased it for incrementals, because the new saveset and the 3 old savesets share a single cache + did this + Might flush cached entries every n seconds, checked on lookups and/or changes in the cached directory list; a treap is nice for this + incrementals! + saveset_mod + saveset_mod -> saveset_summary_mod + chdir once, not all over + split out "files" operations into their own "saveset_files_mod.py" + for an incremental save, it might work well to pick three prior saves to examine for previous results within the specified subset: The most recent, the one with the most files in it, and the most recent complete save. + create Backshift_file.close_enough() + decided against this + create a backshift_file for the 4 savesets of relevance, and compare with a Backshift_file.close_enough() + Put the pieces together + test! + progress via find . -printf '%s %p\0' ? It seems this would eliminate the need for -cmin +0 too + constants_mod + summaries/summaries does not exist - rerun with --init-savedir option? + rename savesets to summaries + 66-rcm-perf needs a proportion for the size of the test and the threshold for "too long" + Get the b'' stuff out of the progress report on 3.x + Unix domain sockets are skipped correctly, but doing so does strange things to the tty stats output + saveset is not written until the 1000th file is processed? That could be pesky on backups with large files. + Improve TRY_FSTAT logic so we only hasattr once + Need to test some UTF-8 pathnames, and perhaps other encodings + revisit __exit__'s and make sure they deal with exceptions well + commented out + rather than files/dir-whatever/files, it probably should be files/dir-whatever/entries. + backup id's have colons in them... + No, backup id's don't, but timestamps in the --list-backups report do. + when writing a content .time, the preceeding directories are chdir'd individually. The initial open already uses a full path though. + Try stat'ing to the content directory first - only mkdir if this fails + make sure we don't traceback if two things try to mkdir at the same time + if files, contents and savesets don't exist, ask before creating them - unless a magic option is given + Save the number of files actually in the saveset - as distinct from the number of files intended to go into the saveset + add a --subset for backup id's, in order to get a better idea of what to start from on a resumed save + Figure out why an interrupted save would have both a start and finish time in savesets + if hostname is localhost.localdomain, error out and tell the user to specify a true hostname somehow + if you get a permission denied when lstat'ing a file during a backup, don't traceback + file types + Directories + Symlinks + Character and block device files + Sockets - ignored + fifos + We won't really know how well this is working, until we have restores and some automated tests therewith + Hostnames in backup id - MacOS xattr metadata? + Save username and group, not just uid and gid + Completed save marker + last time of presence in contents hierarchy, not a separate hierarchy, with -'d prefixes + Directory prefixes (to allow a directory named "files" or "files.db" or whatever) + listing all available backups - --list-backups thinks it needs to create new backup id's? + basic implementation + unit test + sorting + If the user want it sorted, they can sort it + Directory listing + listing the files in a backup + regular files + Directories + Symlinks + Character and block device files + Sockets - ignored + fifos + starting from an arbitrary directory within a backup + hardlinks / Restores! + from an arbitrary starting directory + tar output - renaming on the fly + file types + directories + regular files + plain + hardlinked + A Bloom Filter might be very nice for detecting hardlinks + symlinks + fifo's + device files: character, block - Expiration - of old data - of old metadata - might be nice to make these separate, since metadata takes much less room - optional libodirect - support in CPython 2.x should be straightforward - Might need some tweaking for 3.x; libodirect never tested here - Pypy - Does Pypy cooperate with swig? - Maybe a ctypes-based interface to libodirect for pypy? - Client/server operation - Authentication - Host - User - Concurrency - Encryption for transfers - Renaming a host - Users + misc internal + remove my_split.py + Split Repo out of backshift_file_mod into its own file + add treap.py to documentation list + Not going to do this due to Jython's use of unicode in 2.x: jython via ctypes fstat (or finding that java's open is fstat'ing for us) + Not going to do this until Jython has better ctypes support + gdbm via ctypes (for pypy, and maybe jython too) + Skipped for dohdbm: sort out why gdbm_ctypes is giving gibberish filenames in pypy but not cpython 2.x or 3.x + figure out why there's file content in files/1289715016.78-benchbox-Sat_Nov_13_22_10_16_2010-b1bb981f35a41bd0/usr/src/linux-headers-2.6.35-22/include/linux/sunrpc/files.db and fix + This was apparently dbm.py-related - with my gdbm.py module, it doesn't happen. + Get the b'' stuff out of the directory prefixes on 3.x