• This software is owned by The university of California, Irvine, and is not distributed under any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
  • pyindex can index over 80,000 mail messages in less than an hour and twenty minutes on a modern linux system

    seki-strombrg> pyindex -h
    malloc: using debugging hooks
    Usage: /Dcs/seki/strombrg/bin/pyindex -d databasename [-i] [-s keyword] [-D] [-v] [-a]
    -i says to index all files on stdin
    -s keyword says to list all files containing keyword
    -D says to dump the database contents to stdout
    -v says to operate verbosely (can be repeated)
    -a says to abbreviate (much less disk space - probably faster most of the time too)
    -u says not to abbreviate (much more disk space - probably slower most of the time too
    -A says to use a heuristic to skip lines that look like base64/uuencode/binhex or similar
    -p says to skip indexing pronouns
    -n says to skip indexing numbers (all decimal digits)
    -N says to skip indexing words containing any digits
    
    In any case -d is required
    You must specify exactly one of -i, -s or -D
    
    This program has two engines.  One is abbreviated (-a) and the other is unabbreviated (without -a).
            1) The "unabbreviated" storage mode is very straightforward, so if you need something bug-free that uses tons of space and CPU, use it.  It uses a single database.
            2) The "abbreviated" storage mode is much more complex, and goes to great lengths to reduce storage requirements and runtime
            The abbreviated mode is the default now.
    Tue Jan 31 11:25:49
    

    Download it here


    Hits: 3621
    Timestamp: 2024-03-28 19:07:54 PDT

    Back to Dan's tech tidbits

    You can e-mail the author with questions or comments: