rolling-checksum-mod

rolling_checksum_mod is a set of three modules (two required) for chopping up files into chunks for deduplication.

The chunks are content-based and variable-length.

The point of the algorithm is, for EG, to make it so if you have a 4 gigabyte file, and you insert one byte at a random part of the file, most of the blocks will remain the same - even those after the inserted byte.

The algorithm has been in use in production for years, and is considered stable. However, the packaging of the modules implementating them for pypi is new (2021-04-23).

The three modules are:

Name of module	Required?	Suitable for Pypy3?	Suitable for CPython?	What does it do?
rolling_checksum_mod	Yes	Yes	Yes	Tries to import rolling_checksum_pyx_mod. If that fails, it imports rolling_checksum_py_mod
rolling_checksum_py_mod	Yes	Yes	Yes, but it's slow	Provides the blocking algorithm in Pure Python
rolling_checksum_pyx_mod	No	No	Yes	Provides the blocking algorithm in Cython for speed

I'd like to stress that rolling_checksum_pyx_mod is not needed for speed on Pypy3, and may actually make things slower.

To install rolling_checksum_mod and rolling_checksum_py_mod for pypy3:

pypy3 -m pip install rolling_checksum_py_mod # this includes rolling_checksum_mod too

To install rolling_checksum_mod, rolling_checksum_py_mod and rolling_checksum_pyx_mod for python3:

python3 -m pip install rolling_checksum_py_mod # this includes rolling_checksum_mod too
python3 -m pip install rolling_checksum_pyx_mod

Please note that on some systems, rolling_checksum_mod on Pypy3 is faster, and on other systems rolling_checksum_mod on CPython+Cython is faster. It's not at all a bad idea to compare them.

(Cython transpiles .pyx files to .c, which can be compiled using a C compiler to produce a C extension module for CPython to use)

Hits: 2721
Timestamp: 2025-12-25 06:14:08 PST

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: