Note: This web page was automatically created from a PalmOS "pedit32" memo.

Searching for lots of hostnames taken from a file in large amount of syslog data


The commands:

network-stats-collector-1-strombrg syslog) mkdir /tmp/host-hits

network-stats-collector-1-strombrg syslog) pwd
/target1/syslog
Tue Feb 14 11:44:54

network-stats-collector-1-strombrg syslog) for host in $(cat
/tmp/soe-hosts-to-possibly-eliminate-from-sendmail-hiding ); do echo
"egrep '$(echo $host | sed -e 's/^/\\</' -e 's/$/\\>/' -e
's/\./\\./g')' > /tmp/host-hits/$host"; done > /tmp/commands

network-stats-collector-1-strombrg syslog) cat messages* | reblock -e
$[$(du messages* | awk ' { print $1 }' | total)] 65536 300 | mtee -f
/tmp/commands

Notes on the above:

Yes, reblock will put some nulls at the end of the data to search, but
they aren't going to match our egrep patterns anyway :)  The reblock is
there to get some idea when the searching will be done.

Also, you'd think that the machine would get really bogged down by
all the context switching between the large number of egrep's, but in
practice, the log on nsc-1 plateaued at only about 3.1 when searching
for 160 different hostnames concurrently - IE 160 concurrent egrep's.
I guess Linux just context switches pretty well, despite the x86 hardware
it's running on that doesn't :)

Be careful, as you cut and paste these commands, that any ^'s don't get lost.

Escaping the .'s means that ang.eng won't match ang@eng.

Using \< and \> means that ang.eng.uci.edu won't also match
yang.eng.uci.edu, for example.

This method made it through all 160 egrep's with the following performance:
(estimate: 99.9%  3s) Kbytes: 3593024.0  Mbits/s: 9.3  Gbytes/hr: 4.1
min: 50.0
Tue Feb 14 13:49:15

Back to Dan's palm memos