[Nagiosplug-devel] RFC: New threshold syntax

Vonnahme, Nathan nathan.vonnahme at bannerhealth.com
Thu Apr 3 20:04:55 CEST 2008


The updated RFC looks good, Ton.  I especially like the clarification
and definition of the term 'level', and the published state calculation
rules.  I think the simple

I think you should add a section of other examples, because they
demonstrate the readability and consistency which you're trying to
solve.  Here are some based on our Nagios conf (The first one is your
HTTP example.)... I don't know if we need to include the old examples?
Also, I suppose the last one is how it would apply to check_load, right?


# check the host's HTTP response time, size and age are within "normal"
ranges (it will be CRITICAL otherwise)
	check_http -H $HOSTADDRESS$ \
	  --time=ok=0..5,uom=s      \
	  --size=ok=10..inf,uom=kb  \
	  --age=ok=0..1,uom=d

### BTW, I like that this makes you think, "hrm, maybe infinity size
would be a Bad Thing", unlike the old -m 10000 


# httpd processes are OK if the virtual size is under 8096 bytes.  WARN
until they reach 16182, but bigger than that is CRITICAL.
# old:
	check_procs -w 8096 -c 16182 -C httpd --metric VSZ
# new:
	check_procs -C httpd      \
	  --vsize ok=0..8096,warn=8097..16182


# there should always be one and only one 'tnslsnr' process.  Otherwise
it's CRITICAL.
# old:
	check_procs -w 1:1 -c 1:1 -C tnslsnr
# new:
	check_procs -C tnslsnr --count ok=1..1


# load averages (1,5,15 minute) should be within reasonable ranges.
# old:
	check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0
# new:
	check_load                          \
	  --1min=ok=0..1.0,warn=1.0..1.5    \
	  --5min=ok=0..0.8,warn=0.8..1.3    \
	  --15min=ok=0..0.7,warn=0.7..1.0


That makes me think also about endpoints.  On that check_load example we
should test that a 1 minute load of 1.000 is OK but 1.01 means
WARNING... 

Do we maybe expect that a simple "OK" definition is inclusive of
endpoints, but the warn/critical is exclusive?  

Or is it that overlapping ranges should be evaluated from better to
worse, so that ok=3..5,warn=2..6 works as expected ?  That is, given
this overlapping threshold definition:

	ok=3..5,warn=2..6

we expect these values to give these results

	0:  CRITICAL
	1:  CRITICAL
	2:  WARN
	3:  OK
	4:  OK
	5:  OK
	6:  WARN
	7:  CRITICAL
	8:  CRITICAL

I think if you swap rules #3 and #4 it would evaluate that example as
expected, and it also solves the overlapping endpoints above.

I think you also should change rule #1 (no levels specified) to return
UNKNOWN


-n





More information about the Devel mailing list