highest - finding the highest numbers in a list

highest is a program that finds the highest numbers in a sequence of lines containing numbers.

Motivation:

When a filesystem fills up, a System Administrator is likely to want to get a report listing the 100 (or so) largest files in the filesystem. This won't always pinpoint the source of the problem, but commonly will. This is often done using a command like:
- find /mntpt -type f -print0 | xargs -0 du -k | sort -nr | head -100
highest is a faster way of doing that same thing. EG:
- find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-heap
Generally speaking, when a filesystem fills up, programs start messing up some files by writing and not checking error returns, or by not using temporary files prior to destroying the old copy of a file. So when a filesystem fills up, it's good to fix it ASAP.
Also, on truly large filesystems like GFS or Lustre, you may have so many files that GNU sort starts using temporary files to do the sorting, slowing things down and possibly even compounding the problem you're trying to fix; highest has no such issues.
Also, highest just seemed like a natural place to put my treap module for testing the treap implementation I did in Python/Cython. Although a heap seems like a more natural fit (a heap should use a bit less memory), the treap actually seems to outperform the heap. If you have no other use for treaps, using a treap with highest may not give enough of a performance boost to justify the time to install the module. EG:
- find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-treap

Usage:

$ ./highest --help
Usage: ./highest [-n numtokeep] [-f fieldno] [-a] [-r] [-h] [--filename fn]
-n n specifies the number of numbers to keep
-f n says to compare on the nth field (0 based)
-a says to sort in ASCII order, rather than numerically
-r says to sort in reverse
-s sent says if a field cannot be converted to a float,
   to use sentinel value "sent" instead when sorting
-h says to give usage information (help)
--disallow-duplicates (or just -d) says to only keep one value of a given size
--filename fn says to read from file fn
--bufsize n says to read using a buffer size of n; only meaningful with --filename
--use-bisect
--use-treap
--use-heap
--use-rbtree
--use-fibonacci-heap

Reads from stdin by default, but that's considerably slower than --filename /dev/stdin

A comparison of some different methods of finding the largest files; n is the number of files checked, and m is the number of files desired in the report. Typically, n will be quite a bit larger than m.

Means	Running time	Storage required	Comments
GNU sort \| tail	O(n*log₂(n)+m)	O(n+m)	Fast for small lists. A remarkably good sort
highest with bisect.insort	O(n*m)	O(m)	This is pretty good. It outpaces heapq for a while, but then it suddenly skyrockets
highest with heapq	O(n*log₂(m))	O(m)	Pretty fast even for small lists, and doesn't require any nondefault modules. This is the default option.
highest with a treap	O(n*log₂(m))	O(m)	Not really that fast
highest with a fibonacci heap	O(n*log₂(m))	O(m)	Not really that fast
highest with a Red-Black tree	O(n*log₂(m))	O(m)	Not really that fast

Here is a pair of graphs (in PDF format) describing how the various datastructures perform. As a quick reminder, we're looking at n numbers, and keeping the m best values (lines):
graph-n.pdf
graph-m.pdf

Portions of this software are owned by The University of California, Irvine - and are not under any version of the GPL. GPL is a fine series of licenses, but the partial owners of the software need it to be distributed under these terms.

See also recent, which is more a more practical way of doing this for recent files.

Hits: 12573
Timestamp: 2025-07-26 07:32:01 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: