[Nagiosplug-devel] RFC: New threshold syntax

Matthias Eble matthias.eble at mailing.kaufland-informationssysteme.com
Fri Mar 28 16:19:18 CET 2008


Hi all,

after reading and thinking about this thread for hours, now, I want to 
summarize the discussion up to now. I hope I got the most important 
statements and met the participants' intentions.

So here, we go:

Max suggested:syntax like
-w '1min>15:5min>5' -c '15min>15:5min>10'

Which has downsides:
- thresholds need to be quoted properly (no problem for the people on 
this list, but annoying anyway)
- it's much harder to read than using longopts:  --load1=/15: 
--load5=10:/5: --load15=15:

One general aim is that the threshold specification should be as 
flexible as possible also to prevent the need to run
the same plugin multiple times to get one job done (like Ton's example 
for three check_procs services for testing one process)

Thomas posted links to
   http://physics.nist.gov/cuu/Units/prefixes.html
   http://physics.nist.gov/cuu/Units/binary.html
containing a list of metrics.
He claims that there should be a list of legal UOMs/prefixes and that 
allowing base8 units should be discussed. Maybe some gnulib code
can be used for conversion (Ton). Andreas later noted that  0.2GiB != 
200MiB, while 0.2GB == 200MB which should be kept in mind.

Ton and Thomas agree that Perfdata should be in a fixed UOM and
not the one specified in the thresholds (at least for now).
   - changing the threshold UOM will destroy old graphs
   - Defining a base unit should be up to the respective plugins and be 
as small as possible (sec,bytes,...)
   - Thus uom is optional even when no thresholds are defined (like 
--load1 to just graph load1)

Using scientific notation is omitted for now.

Ton and Thomas agree on dropping +-inf since the colon implies them.
But Nathan also thinks that ranges should always explicitly write both 
sides of the range meaning 10:inf rather than 10:

Andreas could imagine that commandlines could become very complex 
confusing users, but he has no better idea either.
Ton could imagine some helper functions (cmdline, web pages, google
calculator) to verify complex thresholds and Andreas likes to see a 
possibility to shorten --freespace warn=inf:300KB
to --freespace w=inf:300KB.

Andreas also thinks that taking the simplicity off the plugins/specs 
will take off one important advantage of nagios and that
Ton should be shot :D

Additionaly Ton (who should now fear the next nagios conference :) and 
Andreas state that compatibility should and will be retained. At least
for versions prior 2.0.


Thomas dropped in to use getsubopt style arguments like --metric 
min=2,uom_prefix=Ki,uom=b,.. which makes it easier
to keep backward compatibility when introducing new values.

Ton summarized that it all comes down to two things:
   - range definition
   - threshold definition

When it comes to ranges, there are two options: keeping existing ranges 
using ':' or some math style 1<=x<=3 containing "quote me" characters 
(like Max proposed).
But, however multiple styles *might* be possible and could be supported 
parallely.

Thus the options for defining a threshold are (ignoring uom for the
moment):
    1) --threshold-time=crit_range/warn_range
    2) --threshold name=time,warn=range,crit=range
    3) --threshold=time -w range -c range


Thomas thinks about something like
   --threshold name=cpu,type=warn,min=0,max=80,inside
which would lead to another seperator if multiple ranges per metric 
should (possibly) be supported.

Andreas also noted that the warn/crit sanity check needs to be different 
depending
on plugin. Sometimes w < c sometimes w > c

Ton implemented a showcase for the a possible approach into check_procs:
./check_procs -C cron --number=^1:1 --rss-threshold=0: --vsz-
threshold=0: --cpu-threshold=-1:
But currently the non OK output doesn't state which threshold is 
actually exceeded.

Nathan pointed out that it is more intuitive to specify only ok and 
warning ranges.
Everything outside them is critical, which Ton thinks is "brilliant".
Something like:
   --size_ok=300:500b  --size_warn=500b:inf
or
   --size=ok(300:500b),warn(500b:inf)


Nathan added that ':' could be replaced by '..' and using '/' as a range 
seperator:
   --time=ok/0..3/seconds
   --freespace=ok/300..inf/KB,warn/100..300/KB
   --load=ok/0..2,0..1.5,0..1.2/

--End of summary

So to me there are multiple open questions

Key questions:
- Must the threshold specification argument be valid without quoting?
- Is it necessary to allow multiple ranges per thresh warn=10:20,50:60?
- Should thresholds be defined ok/warn rather than warn/crit?
- Should plugins only print perfdata for explicitly selected metrics
   or should there be a base set?
- Should there be an explicit range limit (10:inf over 10:)
- Is it favorable to have multiple range styles like
   1<x<10 *and* 1:10 *and* ... in parallel?

Further questions:
- should perfdata inherit threshold's uom/prefix?
- replace range seperator ':' with '..'?
- Which component is responsible for sanity checking of thresholds?
- Should base8 UOM-prefixes be allowed?


I'll post my thoughts later on.

Hope this is useful.

Matthias




More information about the Devel mailing list