highest is a program that finds the highest
numbers in a sequence of lines containing numbers.
Motivation:
When a filesystem fills up, a System Administrator is likely
to want to get a report listing the 100 (or so) largest files in
the filesystem. This won't always pinpoint the source of the
problem, but commonly will. This is often done using a command
like:
find /mntpt -type f -print0 | xargs -0 du -k | sort -nr | head -100
highest is a faster way of doing that same thing. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-heap
Generally speaking, when a filesystem fills up, programs start
messing up some files by writing and not checking error returns, or
by not using temporary files prior to destroying the old copy of
a file. So when a filesystem fills up, it's good to fix it ASAP.
Also, on truly large filesystems like GFS or Lustre, you may have
so many files that GNU sort starts using temporary files to do the
sorting, slowing things down and possibly even compounding the
problem you're trying to fix; highest has no such issues.
Also, highest just seemed like a natural place to put my treap
module for testing the treap implementation I did in Python/Cython.
Although a heap seems like a more natural fit (a heap should use
a bit less memory), the treap actually
seems to outperform the heap. If you have no other use for treaps,
using a treap with highest may not give enough of a performance boost
to justify the time to install the module. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-treap
Usage:
$ ./highest --help
Usage: ./highest [-n numtokeep] [-f fieldno] [-a] [-r] [-h] [--filename fn]
-n n specifies the number of numbers to keep
-f n says to compare on the nth field (0 based)
-a says to sort in ASCII order, rather than numerically
-r says to sort in reverse
-s sent says if a field cannot be converted to a float,
to use sentinel value "sent" instead when sorting
-h says to give usage information (help)
--disallow-duplicates (or just -d) says to only keep one value of a given size
--filename fn says to read from file fn
--bufsize n says to read using a buffer size of n; only meaningful with --filename
--use-bisect
--use-treap
--use-heap
--use-rbtree
--use-fibonacci-heap
Reads from stdin by default, but that's considerably slower than --filename /dev/stdin
A comparison of some different methods of finding the largest files; n is the number of files checked, and m is the number of files
desired in the report. Typically, n will be quite a bit larger than m.
Means
Running time
Storage required
Comments
GNU sort | tail
O(n*log2(n)+m)
O(n+m)
Fast for small lists. A remarkably good sort
highest with bisect.insort
O(n*m)
O(m)
This is pretty good. It outpaces heapq for a while, but then it suddenly skyrockets
highest with heapq
O(n*log2(m))
O(m)
Pretty fast even for small lists, and doesn't require any nondefault modules. This is the default option.
Here is a pair of graphs (in PDF format) describing how the various
datastructures perform.
As a quick reminder, we're looking at n numbers, and
keeping the m best values (lines):
graph-n.pdf graph-m.pdf
Portions of this software are owned by
The University of California, Irvine -
and are not under any version of the GPL. GPL is a fine series of licenses, but the partial owners of the software need
it to be distributed under these terms.
See also recent, which is more a more
practical way of doing this for recent files.