set-arithmetic is a shell-callable python script for doing set math. It's kind of similar to comm, but I find it simpler and a little more powerful.

It treats files as sets, with one set element per line. Naturally, order doesn't matter.

For --union, --intersection, --difference, --symmetric-difference and --fuzzy-match, the result is written to stdout. For the others, the result is in the exit code.

$ ./set-arithmetic below cmd output started 2022 Fri Mar 18 07:53:57 PM PDT Usage: ./set-arithmetic --union file1 file2 write the union of the two files to stdout --intersection file1 file2 write the intersection of the two files to stdout --difference file1 file2 write the difference of the two files to stdout --symmetric-difference file1 file2 write the symmetric difference of the two files to stdout --is-subset file1 file2 exit true if file1 is a subset of file2 --is-superset file1 file2 exit true if file1 is a superset of file2 --is-proper-subset file1 file2 exit true if file1 is a proper subset of file2 --is-proper-superset file1 file2 exit true if file1 is a proper superset of file2 --is-equal file1 file2 exit true if file1 is equal to file2 --is-unequal file1 file2 exit true if file1 is not equal to file2 --fuzzy-match file1 file2 output 0.0 for no overlap, 1.0 for all overlap, 0 < n < 1.0 for partial overlap This command treats files as sets, one element per line. All output is to stdout or the exit status.About running time, and memory/disk requireents:

Tool | Running time | Space (RAM) | Space (Disk) |

comm alone - rarely used without sorting both inputs | O(n+m) | O(1) | O(n+m) - because the inputs come from files |

comm + sort - a common combination | O(nlogn+mlogm) | O(n+m) - can spill over to disk for huge inputs | O(n+m) - because the inputs come from files, plus possible sort temporaries |

set-arithmetic - does not need sorting | O(n+m) | O(n+m) | O(n+m) - because the inputs come from files |

It's actually a benefit of comm+sort that extra disk space can be consumed, because it means your data doesn't have to fit
entirely in virtual memory. Also, it's worth keeping in mind that GNU sort is pretty amazingly well optimized,
despite being O(nlogn).

Hits: 1451

Timestamp: 2024-09-11 20:54:49 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: