set-arithmetic is a shell-callable python script for doing set math. It's kind of similar to comm, but I find it simpler and a little more powerful.

It treats files as sets, with one set element per line. Naturally, order doesn't matter.

For --union, --intersection, --difference, --symmetric-difference and --fuzzy-match, the result is written to stdout. For the others, the result is in the exit code.

$ ./set-arithmetic
below cmd output started 2022 Fri Mar 18 07:53:57 PM PDT
Usage: ./set-arithmetic
	--union file1 file2                   write the union of the two files to stdout
	--intersection file1 file2            write the intersection of the two files to stdout
	--difference file1 file2              write the difference of the two files to stdout
	--symmetric-difference file1 file2    write the symmetric difference of the two files to stdout
	--is-subset file1 file2               exit true if file1 is a subset of file2
	--is-superset file1 file2             exit true if file1 is a superset of file2
	--is-proper-subset file1 file2        exit true if file1 is a proper subset of file2
	--is-proper-superset file1 file2      exit true if file1 is a proper superset of file2
	--is-equal file1 file2                exit true if file1 is equal to file2
	--is-unequal file1 file2              exit true if file1 is not equal to file2
	--fuzzy-match file1 file2             output 0.0 for no overlap, 1.0 for all overlap, 0 < n < 1.0 for partial overlap

This command treats files as sets, one element per line.
All output is to stdout or the exit status.

About running time, and memory/disk requireents:

It's actually a benefit of comm+sort that extra disk space can be consumed, because it means your data doesn't have to fit entirely in virtual memory. Also, it's worth keeping in mind that GNU sort is pretty amazingly well optimized, despite being O(nlogn).


Hits: 1571
Timestamp: 2024-12-27 08:40:22 PST

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: