This page presents a series of implementations of the same idea in multiple programming languages. They're versioned collectively. This is version 4.1 (because it has 4 languages and two versions in a single language).

The different versions of the program
History
Similar projects conducted by others

Versions: Summary

What language	Version of the program	The license under which it's distributed	Comments
Latest (Third) Python version	3	Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option
Second Python version	2	Copyright (C) The university of California, Irvine - it specifically is not available under any version of the GPL.
First Python version	1	Copyright (C) The university of California, Irvine - it specifically is not available under any version of the GPL.	One of my first Python programs - and it shows
Perl version	2	Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option
Java version	2	Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option
C++ version	2	Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option
OCaml version (planned)	3	N/A
Standard ML version (planned)	3	N/A
Clojure version (planned)	3	N/A
Haskell version (in progress)	3	N/A	O(n*log₂n), because Haskell's map/hash/dict type is O(log₂n) rather than O(1)
Scala version (planned)	3	N/A
Erlang version (planned)	3	N/A
Elixer version (planned)	3	N/A
Eiffel version (planned)	3	N/A
Golang version (planned)	3	N/A
C# (Mono) version (planned)	3	N/A
Object Pascal (planned)	3	N/A
Rust?	3	N/A
Swift? (maybe after the Linux port happens)	3	N/A

History

classify

But classify, for all its really nice, convenient options, is a bit slow on large collections of files, and (years ago) I wanted to look at a collection of "fortune cookie" messages and eliminate duplicates. First, I added a "-" option to classify, to make it able to read a list of files from stdin to get around the argv limit. But it was still pretty slow going for this (at the time) large collection of files and my poor little linux machine. So to handle those cookie files, I wrote a similar program, I've named equivs. equivs, like classify, is O(c*n^2), but unlike classify, equivs uses a sort of incremental md5 digesting algorithm to get a far smaller "c", frequently giving much better run time at the expense of having far fewer convenient options.

But for years, it'd been in the back of my mind that I should redo my equivs program so it would have a running time of O(n*log(n)). And I finally did it. The new program is called equivs2.

Versions: More Detail

"New and Improved" (equivs3) python version

equivs3e

The run time has dropped from O(nlogn) to O(n + mlogm) where n is the total number of files and m is the number of files in the longest hashbucket that suffers an MD5 collision - IOW, it's normally O(n), because MD5 collisions are rare.

For a collection of n duplicate files there will typically be O(2n) full file reads (unless the files are hard linked, in which case things are faster), and unique files are typically read once at most (often they won't need to be read fully even once).

As with equivs2, MD5 is trusted to tell files apart, but when files have the same MD5 hash, it is not assumed that they must be equal; they almost certainly are, but it remains possible that they are not.

Algorithmically speaking, this is the fastest version on this page despite (or you could even say "because of") the implementation language; its asymptotic running time is better.

This version deals with files that disappear or change during a run better than the equivs2 versions, because it uses a merge sort specially tailored to deal well with disappearing files and duplicate files.

equivs3e runs on python 2.x and python 3.x, at a modest performance penalty in 2.x. equivs3d only runs on 2.x.

This version (equivs3) outperforms the C++ version of equivs2. :)

Note that equivs3 is slower on pypy than on cpython.

Python (equivs2) version

equivs2

However, lacking the convenience options of classify doesn't always have to be a show stopper. Sometimes what I do is use find and tr to create a shadow directory that has a modified copy of each file I'm interested in, with an identical directory hierarchy, and then use equivs or equivs2 on that directory hierarchy. Then you get a bunch of relative paths, that apply to either directory hierarchy, with the net effect being the same as with classify, but actually more flexibly - but it's also more time consuming.

equivs2 usage is like:

$ equivs2 -h
Usage: /Dcs/seki/strombrg/bin/equivs2 [-v] [-s] [-0] [-h] [-f file1 file2 ... filen]
-v says to operate verbosely
-s says to get filenames from stdin instead of from the command line
-0 says that when getting filenames from stdin, assume null termination, not newline termination
   Also changes the default output delimeter to a null byte, because some shells cannot handle that
-h says to give this help message
-f file1 file2 ... filen says to use the listed files, not files fromstdin
-p bytes says to cache this many bytes of each file for faster comparisons (may be 0)
-d delim says to use "delim" as the output delimeter character within a line
-c says to not do full comparisions - instead trust hash comparisons.
   Faster at the expense of some accuracy
seki-strombrg:~ i386-redhat-linux-gnu 18834 - above cmd done Sat Apr 22 10:24 PM

$

Perl

equivs2-p

Java

here

It has Makefile's for both OpenJDK and gcj. The OpenJDK Makefile will probably work fine with Sun (Oracle) Java.

Note that you may have to install an en_US.ISO-8859-1 locale on your system for it to work; this is because in java (and other languages?) en_US.ISO-8859-1 is a locale that is guaranteed to give good roundtrip filename conversions from 8 bit to 16 bit characters and back to 8 bit, as the Java runtime reads 8 bit filenames from stdin, converts them to 16 bit internally, and then converts them back to 8 bit again for opening. You can install the appropriate locale with "apt-get install locales-all" on a Debian 7.0 system.

It was fun doing the java version, though the locale stuff was a larger hurdle than usual.

The java version doesn't have the device/inode number optimization. Java doesn't appear to expose that, probably because not all OS's have it.

2013-08-09: Passes findbugs and checkstyle.

C++

Here's a C++ version

Related Work

fdupes is a C program that also finds duplicates. It's faster for collections of small files, but slower for collections of large files. According to the wikipedia page, fdupes uses an almost identical algorithm:
1. They both use md5, even though better cryptographic digest algorithms are available today.
2. They both use stat data, hashes of file prefixes, hashes of entire files, and bit-for-bit comparisons.
3. fdupes appears to be using a binary tree rather than a sort; this is probably a misplaced optimization, as the data may not be coming in in a random order, and even if it is a sort should still be faster.
4. fdupes' performance is pretty comparable to that of equivs2.
Below you can see equivs3e outperforming fdupes on 2 out of 3 hierarchies. The ones where equivs3e was faster were large files: movies (/movie) and music (/home/dstromberg/Sound). The one where fdupes was faster (/usr/local) was mostly small files: .py's, .pyc's and .pyo's.

2394

Back to Dan's tech tidbits

You can e-mail the author with questions or comments:

Table of contents