A modern backup system design :) Should have: 1) A database of hash->backend storage filename. In fact, might even just name files by their hashes - but compressed with whatever compression algorithm works well. In fact, we might be able to use "file" to determine what compression algorithms to use 2) A database of machines backed up filenames' to hashes. This one'll get pretty big. Or should it be one database per host? 3) On the machines to be backed up, we should go through files and make sure that their hashes haven't changed unless their mtime's have too - otherwise, we probably have disk corruption 4) A means of using O_DIRECT, if the OS supports it, while doing the hashes on the client machines If we do use hashes as filenames, we'll need a directory hierarchy that's named by parts of the hash, say maybe two or three or four hex digits per directory level Should there be a mapping from: 1) hostname to base255 2) base255 to hostname 3) hash to base255 4) base255 to hash ...to save space? Would it really save that much space? Yeah, it probably would In python, naturally. Is there any other choice? :) We'll need to store somewhere, what compression algorithms were used... We'll perhaps want some way of managing the tradeoff due to hard compression vs fast compression as they relate to the bandwidth... EG, rzip packs pretty hard, but takes forever... Should be able to compress with gzip, bzip2, rzip or cat (noop)...