rolling_checksum_mod is a set of three modules (two required) for chopping up files into chunks for deduplication.

The chunks are content-based and variable-length.

The point of the algorithm is, for EG, to make it so if you have a 4 gigabyte file, and you insert one byte at a random part of the file, most of the blocks will remain the same - even those after the inserted byte.

The algorithm has been in use in production for years, and is considered stable. However, the packaging of the modules implementating them for pypi is new (2021-04-23).

The three modules are:

I'd like to stress that rolling_checksum_pyx_mod is not needed for speed on Pypy3, and may actually make things slower.

To install rolling_checksum_mod and rolling_checksum_py_mod for pypy3:

To install rolling_checksum_mod, rolling_checksum_py_mod and rolling_checksum_pyx_mod for python3:

Please note that on some systems, rolling_checksum_mod on Pypy3 is faster, and on other systems rolling_checksum_mod on CPython+Cython is faster. It's not at all a bad idea to compare them.

(Cython transpiles .pyx files to .c, which can be compiled using a C compiler to produce a C extension module for CPython to use)


Hits: 1365
Timestamp: 2024-04-28 11:31:01 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: