• This software is owned by The university of California, Irvine - not any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
    highest is a program that finds the highest numbers in a sequence of lines containing numbers.

  • Usage:
  • A comparison of some different methods of finding the largest file in a directory hierarchy.
    GNU find and xargs (to work on files with newlines in them)? Traditional *ix find and xargs (newlines are a problem) Small to medium number of files Large number of files Huge number of files Commentary O notation: n is the number of files, m is the number of files to keep, n assumed always >> m (n much larger than m), and c is a (usually) large constant Concurrency Command
    Yes No Yes No No Very simple, but poor accuracy (filenames with newlines are troublesome) and scalability O(n*log(n)+m) Poor - sort is most likely going to read the whole list before starting to sort
    find /mtpt -type f -print | xargs du -s | sort -nr | head -10
    No Yes Yes No No Pretty simple, good accuracy, but poor scalability O(n*log(n)+m) Poor - sort is most likely going to read the whole list before starting to sort
    find /mtpt -type f -print0 | xargs -0 du -s | sort -nr | head -10
    No Yes Yes No No Kind of complex, good accuracy, but a little better scalability if you are able to assume that all your numbers will be over 10000 O((n/c)*log(n/c)+m) Poor - sort is most likely going to read the whole list before starting to sort
    find /mtpt -type f -size 10000c -print0 | xargs -0 du -s | sort -nr | head -10
    No Yes No Yes No More complex, but pretty good accuracy and scalability O(n+n*log(m)) Good - highest should do its processing interleaved well with the processing done by the find and du
    find /mtpt -type f -print0 | xargs -0 du -s | highest
    No Yes No No Yes Pretty complex, but pretty good accuracy and excellent scalability, assuming you now all the largest numbers will be above 10,000 O(n/c+n*log(m)) Good - highest should do its processing interleaved well with the processing done by the find and du
    find /mtpt -type f -size 10000c -print0 | xargs -0 du -s | highest
    Here's a graph comparing the performance of highest and GNU sort. As you can see, highest never beats GNU sort in the ranges compared, but:
    1. GNU sort stops working. This is because highest doesn't require disk space to do its work; GNU sort does.
    2. Looking at the graph, you can see a slight tendency for the GNU sort line to cross the highest line. I believe that before long, highest would be outperforming GNU sort significantly - but only for pretty large datasets and only if GNU sort has enough disk space to work.





    Back to Dan's tech tidbits