[Nagiosplug-help] How to setup "delayed" host down/unreachable notifications with e.g. check_icmp?

Andreas Ericsson ae at op5.se
Mon Jan 8 13:20:22 CET 2007


Ralph.Grothe at itdz-berlin.de wrote:
> Dear Nagios Users,
> 
> 
> As check_command in them I have used the check_host designation
> of the check_icmp plugin (i.e. hard or soft link).
> 
> In my host and service definitions also notifications_enabled is
> 1, all notification_options apart from flapping
> (whose detection I disabled globally in nagios.cfg) are set (i.e.
> d,u,r for hosts, and w,u,c,r for services),
> check as well as notification periods are 24x7, and
> max_check_attempts for both is 5.
> 
> Only for the contacts did I set host_notification_options to n.
> This was because otherwise there could be the peril of down or
> unreachable host notification
> floods in case a host was unpingable for a relatively short time,
> like during a quick reboot
> or network or router outage with respect to the route from the
> nagios server
> (or would this already account for flapping?).
> 

No, flapping is when something changes state more than (insert proper 
vaiable name for flapping-percentage here) percent times over the last 
21 executions of its check.

> 
> With  these in place this is what happens for example if I down a
> NIC on a host temporarily:
> 
> [1168094094] HOST ALERT: tiber;DOWN;SOFT;1;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094105] HOST ALERT: tiber;DOWN;SOFT;2;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094116] HOST ALERT: tiber;DOWN;SOFT;3;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094127] HOST ALERT: tiber;DOWN;SOFT;4;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094139] HOST ALERT: tiber;DOWN;HARD;5;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094139] SERVICE ALERT:
> tiber;icmp-host-alive;CRITICAL;HARD;1;CRITICAL - 123.123.123.123:
> rta nan, lost 100%
> 
> 
>>From the docs I would have assumed that a service notification
> would be emitted
> because the icmp-host-alive service transited right into a hard
> critical state (i.e. hard state change).
> But this didn't happen.
> 

Service notifications are suppressed for hosts that are down. This is to 
prevent a flood of notifications when hosts go down.


> Admittedly, even such a service notification wouldn't alliveate
> anything as it would still come too early,

Precisely, and you wouldn't get just one, but several notifications (one 
for each service).

> 
> On the other hand, once a host was confirmed to be down (or
> unreachable) 
> I would assume that nagios wouldn't schedule the icmp-host-alive
> service for this host anymore
> but instead reattempt own (randomly?) scheduled host checks until
> one host_check packet returned OK
> and relapsed to a host HARD OK state, which in turn would
> reactivate regularily scheduled service checks.
> 
> While that host was down (until I upped the NIC again) also no
> other service checks
> were performed (what seems quite in order, because what sense
> what they make).
> The downside however was, that as well not a single notification
> about the sudden unavailability of
> any service related to this host was sent to configured contacts.
> So such an outage would at worst pass totally unnoticed by the
> responsible admins
> which defeats the whole purpose of monitoring.
> 

Errors passing unnoticed certainly defeats the purpose of monitoring, 
but you explicitly told Nagios not to notify you about host down events, 
so it's doing The Right Thing(tm).


> So how can one reconcile the seemingly contradicting requirements
> of delayed host down notifications
> and service critical notifications?
> 

Enable host down notifications for all contacts.

To prevent host notifications going out for temporary glitches (fe a 
reboot), use the  first_host_notification_delay patch coded by Mathias 
Sundman and sent in by me to the nagios-devel list. You'll find it in 
the archives somewhere and it has been incorporated into the Nagios 3 
codebase.

What it does, basically, is to add a new variable called 
first_host_notification_delay to host objects. When a host goes down, 
the first notification for that host is delayed until *at least* the 
configured time has passed. I say at least, because nagios doesn't even 
look at the value until it does another check of the same host and 
notices that it's still down. If, by then, first_host_notification_delay 
* interval_length seconds have passed, it will send a notification.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231




More information about the Help mailing list