[Nagiosplug-devel] RFC: New threshold syntax

Vonnahme, Nathan nathan.vonnahme at bannerhealth.com
Tue Apr 1 00:55:57 CEST 2008


> From: nagiosplug-devel-bounces at lists.sourceforge.net
[mailto:nagiosplug-
> devel-bounces at lists.sourceforge.net] On Behalf Of Matthias Eble

> > Ton could imagine some helper functions (cmdline, web pages, google
> > calculator) to verify complex thresholds
> 
> That could also be part of the library so every plugin could have a
> dryrun option to print which values would cause what. Based on the
> defined thresholds, (for example x:y) one could test/print what rc the
> values x,y,x+1,x-1,y+1,y-1 would cause.
> 

I *really* like that idea, Matthias!  It might be tricky for plugins
like check_procs or check_http that are checking multiple parts of
another program's output.

> > - Is it necessary to allow multiple ranges per thresh
warn=10:20,50:60?
> 
> The Performance data definition doesn't permit this up to now but I
> could imagine some people would like to see this.

I'd say "not necessary" -- there are workarounds (like using inversion,
or two checks) for the few cases where you'd want this, and supporting
it would be complex. 


> > - Should thresholds be defined ok/warn rather than warn/crit?
> 
> I like the approach but this means not only the syntax is changed.
> People need to start thinking when converting.

We don't need to abandon or break the warn/critical options, although at
some distant point it might be good to move away from the -W and -C
syntax.

Specifying "normal" and flagging exceptions instead of trying to
"enumerate badness" is a good practice in many areas (testing, security,
quality control).  

I think it's also how sysadmins think about their systems, right?  "On
this machine, the disk is normally 50-80% full."  If you only think in
terms of warn/critical, you might only think about the upper boundary,
and have alerts when usage goes over 80%.  But if you specify "OK", you
may get an unexpectedly valid alert one day when your disk is suddenly,
mysteriously, 1% full :)

(actually that's also why check_disk's "free space" (instead of used
space) approach has often confused me, though I can see several good
reasons for it)

> > - Should there be an explicit range limit (10:inf over 10:)
> 
> 10:inf or 10::inf looks cleaner to me.

I am always in favor of explicit (10:inf or 10..inf), because it
optimizes reading, which you do more often than writing, and because
newcomers read examples before they write.

> > - Is it favorable to have multiple range styles like
> >    1<x<10 *and* 1:10 *and* ... in parallel?
> 
> Not if you ask me.

Agreed!  And Thomas is right-- if you hate the supported syntax you can
always write a script or utility to run the plugins or generate the
options for you.  The extremely lazy typists out there can also probably
use various macro-like utilities to overcome any gratuitously explicit
characters :)

> Since it looks like the default alerting mechanism will be "inside",
> default range behaviour for plain numbers (X gets 0:X) should be
> reversed, too. So X will result in X:inf instead of 0:X
> Or should we drop those plain thresholds completely?

I'd like to see plain thresholds go away eventually, because in some
existing cases X means 0:X or -inf:X and in others it means X:inf.
Also, I think it's important to get users thinking in terms of ranges
rather than single numbers.

> What about mixing uom-prefix in one range? Might this be needed in the
> future?

-1


> At the moment, my favourite threshold/range definition is following:
>    --throughput ok=1..5/M,warn=1..300/M/B

Let's check the readability of some examples:

check_http https://foo.com
--time ok=0..5/s,warn=5..10/s             \
--size ok=3..5/kB                         \
--ssl-expiry ok=28..inf/d,warn=14..28/d


old:  check_procs -w 8096 -c 16182 -C httpd --metric VSZ
new:  check_procs -C httpd --vsize ok=0..8096,warn=8096..16182

old: check_procs -w 6:13 -c 4:18 -u mqm -a AKBLD
new: check_procs -u mqm -a AKBLD --count ok=6..13,warn=4..18
 (I'm not sure whether that overlapping warn range would work) 

old: check_procs -w 1:1 -c 1:1 -C tnslsnr
new: check_procs -C tnslsnr --count ok=1..1 








More information about the Devel mailing list