highest is a program that finds the highest
numbers in a sequence of lines containing numbers.
When a filesystem fills up, a System Administrator is likely
to want to get a report listing the 100 (or so) largest files in
the filesystem. This won't always pinpoint the source of the
problem, but commonly will. This is often done using a command
find /mntpt -type f -print0 | xargs -0 du -k | sort -nr | head -100
highest is a faster way of doing that same thing. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-heap
Generally speaking, when a filesystem fills up, programs start
messing up some files by writing and not checking error returns, or
by not using temporary files prior to destroying the old copy of
a file. So when a filesystem fills up, it's good to fix it ASAP.
Also, on truly large filesystems like GFS or Lustre, you may have
so many files that GNU sort starts using temporary files to do the
sorting, slowing things down and possibly even compounding the
problem you're trying to fix; highest has no such issues.
Also, highest just seemed like a natural place to put my treap
module for testing the treap implementation I did in Python/Cython.
Although a heap seems like a more natural fit (a heap should use
a bit less memory), the treap actually
seems to outperform the heap. If you have no other use for treaps,
using a treap with highest may not give enough of a performance boost
to justify the time to install the module. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-treap
$ ./highest --help
Usage: ./highest [-n numtokeep] [-f fieldno] [-a] [-r] [-h] [--filename fn]
-n n specifies the number of numbers to keep
-f n says to compare on the nth field (0 based)
-a says to sort in ASCII order, rather than numerically
-r says to sort in reverse
-s sent says if a field cannot be converted to a float,
to use sentinel value "sent" instead when sorting
-h says to give usage information (help)
--disallow-duplicates (or just -d) says to only keep one value of a given size
--filename fn says to read from file fn
--bufsize n says to read using a buffer size of n; only meaningful with --filename
Reads from stdin by default, but that's considerably slower than --filename /dev/stdin
A comparison of some different methods of finding the largest files; n is the number of files checked, and m is the number of files
desired in the report. Typically, n will be quite a bit larger than m.
GNU sort | tail
Fast for small lists. A remarkably good sort
highest with bisect.insort
This is pretty good. It outpaces heapq for a while, but then it suddenly skyrockets
highest with heapq
Pretty fast even for small lists, and doesn't require any nondefault modules. This is the default option.
Here is a pair of graphs (in PDF format) describing how the various
As a quick reminder, we're looking at n numbers, and
keeping the m best values (lines):
Portions of this software are owned by
The University of California, Irvine -
and are not under any version of the GPL. GPL is a fine series of licenses, but the partial owners of the software need
it to be distributed under these terms.