Most people who've used nagios for a while, have encountered the
difficulty of a host or service that comes and goes frequently,
generating a mound of warnings to your e-mail, to your instant
messaging service, to your cell phone, and so on.
Nagios includes experimental "flap detection". That is one approach
to solving the problem. This page presents another approach: A script
called "rate-limit", which is suitable for calling from procmail.
Light theory:
First we add nagios notices to a file in /tmp, via a call to this
program from procmail (or other facility for conditionally piping to programs).
Then periodically, we batch those notices into one summary
message and e-mail it somewhere (like to a cell phone), again by
calling this program, but this time from cron.
The cron job also removes the summarized messages from the file in
/tmp to avoiding resending them.
Usage is like the following:
dcs-strombrg> ./rate-limit
./rate-limit -a # adds a message on stdin
./rate-limit -r # emails a summary of all previously
added hosts since the previous -r
./rate-limit -h # help
./rate-limit -f From_address
./rate-limit -t To_address
./rate-limit -s Subject
This program collects 1 or more nagios notices (fed
in with option -a and reading from stdin), and
later summarizes those notices with -r. -r
requires -f, -t and -s for the From address, To
address and Subject of the summary e-mail message.
The e-mail message is suitable for sending to your
own inbox, or for sending on to a cell phone or pager
Basically, you want to call rate-limit -a from your .procmailrc
with something like the following, to queue up notices:
:0 c
* ^From.*sender@host.com
* ^Subject:.*(Host DOWN alert|Host UP alert|PROBLEM alert|RECOVERY alert)
| $HOME/bin/rate-limit -a
The "c" on the first line tells procmail to send a copy of the
message into the subsequent lines (the present procmail rule),
while passing another copy of the same message on to
subsequent procmail rules below the current
rule. In other
words, the mail is handled in two ways - once by this rule, and
once by any relevant rules that follow. If you drop the "c", then
only this rule should see the message.
And you want to call rate-limit -r from cron, to remove messages
from the queue and send an e-mail message summarizing them:
You can add such a line to cron with crontab -e. The
message will be set up to look like it come from from "Nagios",
will go to "pager@host.com", and the subject header will have the
subject "Subject". This cron job will send a summary, if any
notices have been queued, once every 20 minutes. Other intervals
may work better for you, and some may want to only send summaries
during the day, or during the workweek.
Sample output looks like:
Tue Sep 07 08:40
esmf.ess remote-df OK 2* CRIT 1
ping OK 2* CRIT 2
nfs OK 2* CRIT 1
ssh OK 1* CRIT 1
loadl OK 1* CRIT 1
tokyo.nac https CRIT 1*
gram.eng host UP 1* DWN 1
liter.eng host UP 1* DWN 1
meter.eng host UP 1* DWN 1
network-stats-collector-1.nacs yup OK 1* CRIT 1
network-stats-collector-3.nacs yup OK 2* CRIT 3
This summary describes 21 different nagios notices. As
such it's too big to fit in a single message on my cell phone;
in fact, it's actually spread across 3 messages. Then again, 3
messages is still a lot better than 21.
The date is the time the message is put together, not the time of
any of the events.
If a host has experienced events related to more than one
service, then the first service will be on the same line as the
hostname, while the other services will be on following lines,
until the next hostname appears.
The numbers are the number of times a particular state on a
particular host was encountered.
The *'s indicated which state was encountered most
recently. Please note that this is computed using the Date:
header of the event e-mails.
As it happens, the services' states should always appear in
reverse cronological order of the most recent time that that
service's state was seen. As a result, the * should always
be on the first state for a given service.