|What language||The license under which it's distributed|
|Latest Python version||Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option|
|Original Python version||Copyright (C) The university of California, Irvine - it specifically is not available any version of the GPL.|
|Perl version||Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option|
|Java version||Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option|
|C++ version||Copyright (C) Dan Stromberg, available under the GPL, v2, or any later version, at your option|
|Haskell version (planned)|
|Eiffel/Sather version (planned)|
But classify, for all its really nice, convenient options, is a bit slow on large collections of files, and (years ago) I wanted to look at a collection of "fortune cookie" messages and eliminate duplicates. First, I added a "-" option to classify, to make it able to read a list of files from stdin to get around the argv limit. But it was still pretty slow going for this (at the time) large collection of files and my poor little linux machine. So to handle those cookie files, I wrote a similar program, I've named equivs. equivs, like classify, is O(c*n^2), but unlike classify, equivs uses a sort of incremental md5 digesting algorithm to get a far smaller "c", frequently giving much better run time at the expense of having far fewer convenient options.
But for years, it'd been in the back of my mind that I should redo my equivs program so it would have a running time of O(n*log(n)). And I finally did it. The new program is called equivs2.
The run time has dropped from O(nlogn) to O(n + mlogm) where n is the total number of files and m is the number of files in the longest hashbucket that suffers an MD5 collision - IOW, it's normally O(n), because MD5 collisions are rare.
For a collection of n duplicate files there will typically be O(2n) full file reads (unless the files are hard linked, in which case things are faster), and unique files are typically read once at most (often they won't need to be read fully even once).
As with equivs2, MD5 is trusted to tell files apart, but when files have the same MD5 hash, it is not assumed that they must be equal; they almost certainly are, but it remains possible that they are not.
Algorithmically speaking, this is the fastest version on this page despite (or you could even say "because of") the implementation language; its asymptotic running time is better.
This version deals with files that disappear or change during a run better than the equivs2 versions, because it uses a merge sort specially tailored to deal well with disappearing files and duplicate files.
equivs3e runs on python 2.x and python 3.x, at a modest performance penalty in 2.x. equivs3d only runs on 2.x.
This version (equivs3) outperforms the C++ version of equivs2. :)
Note that equivs3 is slower on pypy than on cpython.
However, lacking the convenience options of classify doesn't always have to be a show stopper. Sometimes what I do is use find and tr to create a shadow directory that has a modified copy of each file I'm interested in, with an identical directory hierarchy, and then use equivs or equivs2 on that directory hierarchy. Then you get a bunch of relative paths, that apply to either directory hiearchy, with the net effect being the same as with classify, but actually more flexibly - but it's also more time consuming.
equivs2 usage is like:
$ equivs2 -h Usage: /Dcs/seki/strombrg/bin/equivs2 [-v] [-s] [-0] [-h] [-f file1 file2 ... filen] -v says to operate verbosely -s says to get filenames from stdin instead of from the command line -0 says that when getting filenames from stdin, assume null termination, not newline termination Also changes the default output delimeter to a null byte, because some shells cannot handle that -h says to give this help message -f file1 file2 ... filen says to use the listed files, not files fromstdin -p bytes says to cache this many bytes of each file for faster comparisons (may be 0) -d delim says to use "delim" as the output delimeter character within a line -c says to not do full comparisions - instead trust hash comparisons. Faster at the expense of some accuracy seki-strombrg:~ i386-redhat-linux-gnu 18834 - above cmd done Sat Apr 22 10:24 PM $
It has Makefile's for both OpenJDK and gcj. The OpenJDK Makefile will probably work fine with Sun (Oracle) Java.
Note that you may have to install an en_US.ISO-8859-1 locale on your system for it to work; this is because in java (and other languages?) en_US.ISO-8859-1 is a locale that is guaranteed to give good roundtrip filename conversions from 8 bit to 16 bit characters and back to 8 bit, as the Java runtime reads 8 bit filenames from stdin, converts them to 16 bit internally, and then converts them back to 8 bit again for opening. You can find install the appropriate locale with "apt-get install locales-all" on a Debian 7.0 system.
It was fun doing the java version, though the locale stuff was a larger hurdle than usual.
The java version doesn't have the device/inode number optimization. Java doesn't appear to expose that, probably because not all OS's have it.
2013-08-09: Passes findbugs and checkstyle.
You can e-mail the author with questions or comments: