[Nagiosplug-devel] Working on testcases

Ton Voon ton.voon at altinity.com
Thu Nov 10 16:51:55 CET 2005


On 9 Nov 2005, at 19:11, Ethan Galstad wrote:

> I'm a bit late into this thread, but here are some of my thoughts...
>
> At least one person should be getting notifications for UNKNOWN
> states, as they can be important.  The UNKNOWN state doesn't really
> have a clear definition, but here's what I think it should be used to
> signify...
>
> 1. Invalid command line args passed to the plugin (e.g. the plugin
> doesn't know what to do).
>
> 2. Internal failures in the plugin itself which prevent it from
> performing a check (i.e. malloc() failures, unexpected system call
> failures, or anything else that needs to be done - but fails - before
> a check can be performed).  As an example, the check_dhcp plugin
> returns an UNKNOWN state if it can't determine the local hardware
> address or bind to port 68.
>
> 3. Nagios will also assign an UNKNOWN state to any
> plugin/script/whatever that either doesn't exist on the filesystem or
> returns a code that is out-of-bounds in accordance with the plugin
> specs.

So the guidelines should be updated with:

"UNKNOWN is for invalid command args or any other failure before the  
requested check can be performed - with the only exception being  
hostname lookups which should return CRITICAL."

Some example changes based on the advice above:

(1) check_http -H webserver

This returns OK if it can connect to the webserver and returns data.

(2) check_http -H webserver -w 2

This returns OK if can connect to webserver and returns data within 2  
seconds. If it cannot connect, then this returns UNKNOWN because it  
is not the metric that is being requested to check against (currently  
returns CRITICAL).

(3) check_http -H webserver -r 'string_to_find'

This returns OK if it can find the server and return data with the  
string. If it cannot connect to the server (currently CRITICAL), or  
gets a 302 redirection (currently OK (?) ), this should be an UNKNOWN.

(4) check_http -H webserver --pagesize=1000

Returns OK if it can find server and the web page size is >= 1000  
bytes. If it cannot connect to server (currently CRITICAL) or get a  
302 redirection (currently OK), this should return UNKNOWN.

(5) check_http -H webserver --pagesize=1000 -w 2

Returns OK if it can find server, the web page size is >= 1000 bytes  
and time taken is <= 2. If it cannot connect to server (currently  
CRITICAL) or get a 302 redirection (currently OK), this should return  
UNKNOWN.

Is this right? I'm starting to think so. It is clear to me now what  
state should be returned given what is actually being asked to check.  
In fact, a side effect is it clearly defines what perf data should be  
returned: (2) should return time taken, (4) should return page size,  
(5) should return both, whereas (1) and (3) shouldn't return anything.

(There's an issue re: inconsistent arguments. I think probably it  
should be something like:

check_http -H webserver --metric pagesize -w 1000 --metric time -w 2

But that's another story.)

Back to UNKNOWN - should we do it?

(However, it still doesn't make sense to treat hostname lookups  
differently, but if that's the consensus, I'll go with it.)

Ton

(So much work, so little time ....)


http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon






More information about the Devel mailing list