highest is a program that finds the highest
numbers in a sequence of lines containing numbers.
Motivation:
When a filesystem fills up, a System Administrator is likely
to want to get a report listing the 100 (or so) largest files in
the filesystem. This won't always pinpoint the source of the
problem, but commonly will. This is often done using a command
like:
find /mntpt -type f -print0 | xargs -0 du -k | sort -nr | head -100
highest is a faster way of doing that same thing. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-heap
Generally speaking, when a filesystem fills up, programs start
messing up some files by writing and not checking error returns, or
by not using temporary files prior to destroying the old copy of
a file. So when a filesystem fills up, it's good to fix it ASAP.
Also, on truly large filesystems like GFS or Lustre, you may have
so many files that GNU sort starts using temporary files to do the
sorting, slowing things down and possibly even compounding the
problem you're trying to fix; highest has no such issues.
Also, highest just seemed like a natural place to put my treap
module for testing the treap implementation I did in Python/Cython.
Although a heap seems like a more natural fit (a heap should use
a bit less memory), the treap actually
seems to outperform the heap. If you have no other use for treaps,
using a treap with highest may not give enough of a performance boost
to justify the time to install the module. EG:
find /mntpt -type f -print0 | xargs -0 du -k | highest -n 100 --use-treap
Usage:
dstromberg@benchbox:~/src/svn/svn/highest/trunk$ ./highest --help
Usage: ./highest [-n numtokeep] [-e every] [-f fieldno] [-a] [-r] [-h] [--filename fn]
-n n specifies the number of numbers to keep
-e n says to only look at 1 in every n values
-f n says to compare on the nth field
-a says to sort in ASCII order, rather than numerically
-r says to sort in reverse
-s sent says if a field cannot be converted to a float,
to use sentinel value "sent" instead when sorting
-h says to give usage information (help)
--disallow-duplicates (or just -d) says to only keep one value of a given size
--filename fn says to read from file fn
--bufsize n says to read using a buffer size of n; only meaningful with --filename
--use-bisect
--use-treap
--use-heap
Reads from stdin by default, but that's considerably slower than --filename /dev/stdin
A comparison of some different methods of finding the largest files; n is the number of files checked, and m is the number of files
desired in the report. Typically, n will be quite a bit larger than m.
Means
Running time
Storage required
Comments
GNU sort | tail
O(n*log2(n)+m)
O(n+m)
Fast for pretty small lists
highest with bisect.insort
O(n*m)
O(m)
There's not really any reason to use this
highest with heapq
O(n*log2(m))
O(m)
Pretty fast on large lists, and doesn't require any nondefault modules
Quite fast on large lists, but requires the treap module
In the spirit of full disclosure, I should mention that part of the
reason the treap version is faster than the heap version, is that the
treap version is able to use a tuple without loss of reverse ordering,
while the heap version uses a class.
Here is a pair of graphs (in postscript format) describing how highest with
a treap compares to highest with a heap, to highest with bisect.insort,
and to GNU sort | tail. As a quick reminder, we're looking at n numbers, and
keeping the m best values (lines):
graph-n.ps graph-m.ps
Here they are again as png images (they'll probably look better as
postscript, but not everyone has a postscript previewer handy):
graph-n.png graph-m.png
Portions of this software are owned by The University of California, Irvine -
and are not under any version of the GPL. GPL is a fine series of licenses, but the partial owners of the software need it to be distributed under these terms.