[Nagiosplug-devel] RFC: New threshold syntax

Vonnahme, Nathan nathan.vonnahme at bannerhealth.com
Thu Mar 20 20:39:19 CET 2008


Even in 2008, Outlook badly quotes:
> -----Original Message-----
> From: nagiosplug-devel-bounces at lists.sourceforge.net
[mailto:nagiosplug-
> devel-bounces at lists.sourceforge.net] On Behalf Of Ton Voon
> Sent: Wednesday, March 19, 2008 9:29 AM
 
> So this problem can be broken down into:
>    - how to specify ranges
>    - how to specify thresholds


I like Ton's proposal (and the fact that he has written code to
implement it!), but I think it's still a little confusing.

When I first started working with plugins (especially trying to write my
own with N::P), I was confused about 2 things, which correspond to the
two parts of the problem:

First, many of the plugins assume the numbers you're giving them are
maximums (x), but secretly they are always translated into ranges
(-infinity .. x).  Ton's range syntax (min:max) does as well as any at
expressing the ranges, and it would be good to replace the implicit
ranges (x secretly means -infinity..x) with explicit ones to keep
readers and writers of plugin args straight.

But the second, and most ambiguous part of the old (and still the new)
syntax is whether the range you're defining constitutes "normal" or
"abnormal".  If you're looking at one of the above definitions for the
first time, is that clear?   When I first started messing with plugins,
the inside/outside and OK/warning/critical meanings of the ranges kept
flipflopping in my brain, and I ended up writing a bunch of tests in
Perl to figure it out.

Part of it is that it's natural to think sometimes in terms of the
normal range, and sometimes in terms of the abnormal range(s).

What you might mean, in English, is something like:

Get this URL.  Warn me if it's over 1 day old or too big.  But it's
critical if it is too small or produces an HTTP error code or the check
fails, or it takes over 5 seconds to get the response.

You could also express it like this:

Get this URL.  It's OK for it to 
	return a HTTP 200 code
		(it's CRITICAL otherwise)
  and
	be less than 1 day old 
		(just WARN me if it's over 1 day old; it's never
CRITICAL)
  and
	be between 300 and 500 bytes
		(I mean, WARN me if it's over 500 but it's CRITICAL if
it's under 300)
  and 
	take less than 5 seconds to respond
		(otherwise it's CRITICAL)


How do we clearly communicate that to the plugin we're running?
It looks like, under Ton's RFC, in the interest of "convention over
configuration", the plugins assume:
	* you're giving it the abnormal range for each check ["alert is
raised if value is inside start and end range (inclusive of endpoints)"]
       * but you can specify the OK range by negating the range with ^.
The ^ flips whether you're defining the normal or abnormal range.

so if I've got Ton's RFC right, I could express the above this way:
	check_http -H $HOSTADDRESS$ --size=:300/500:b --age=/1:d
--time=5:s
or alternatively,
	check_http -H $HOSTADDRESS$ --size=^300:500/500:b --age=/^0:1d
--time=^0:5s

Maybe I'm warped by using the ok() style test functions in Perl, but the
first way seems backwards.  The second way (defining "normal") makes
more sense, so when reading or writing the arguments I keep forgetting
whether I'm telling it "normal" or "abnormal".

Also, it is usually better to enumerate the "good" results and flag
abnormal exceptions than to try to enumerate badness (See `perldoc
perlsec` or
http://www.ranum.com/security/computer_security/editorials/dumb/).

So I'd like to suggest less ambiguous option names and three more
conventions:
	1. we start by defining "normal" (OK), and assume everything
else is CRITICAL
	2. we have to explicitly differentiate between CRITICAL and
"just" WARNING 
	3. we always explicitly write both sides of the range so it
looks like a range (x:y) instead of a number with weird punctuation (x:
or :y)

So a sample syntax would be:

	check_http -H $HOSTADDRESS$                 \
	  --size_ok=300:500b  --size_warn=500b:inf  \    # (but infb
seems problematic)
	  --age_ok=0:1d       --age_warn=1d:inf     \    # (same with
infd)
	  --time_ok=0:5s

That seems more readable to me; does anyone else think so?  The location
of "ok"/"warn" is negotiable; it would also work to do 
	--size=ok(300:500b),warn(500b:inf)
or maybe
	--size=ok(300:500),warn(500:inf),bytes
or
	--size=ok(300:500,bytes),warn(500:inf,bytes)
	--time=ok(0:5,seconds)

Which is getting quite languageish actually.  

Even though these suggestions would mean a few more characters to type,
I would be more likely to write it correctly the first time, which is
the real time saver :)  

I guess a side effect is that inversion (which could still use ^ or how
about "not" or "outside", e.g. --size=ok(not 3:5) ) would be rare, and
the critical threshold would not ever need to be explicitly defined but
would be secretly calculated by inverting the OK definition, and then
the warning threshold defined independently.

-n






More information about the Devel mailing list