[Nagiosplug-devel] RFC: new style command arguments for thresholds

Gavin Carr gavin at openfusion.com.au
Tue Jan 16 03:53:12 CET 2007


Hi Ton,

On Fri, Jan 12, 2007 at 02:52:46PM +0000, Ton Voon wrote:
> There are three main problems:
> 
> 1) when you have a check that wants to check multiple "things", the  
> syntax is confusing. For example, free disk space in check_disk is - 
> w/-c (in units or percent), but inode checking is -W/-K. In  
> check_http, -w/-c is for time taken, -m is for page size. This is not  
> very readable and inconsistent

Agreed.

> 2) the output and performance data is inconsistent with what is being  
> checked. For instance, if I check my disks for inodes, I don't  
> necessarily want perf data returned about disk free. This clogs up my  
> graphs and muddies my output
> 
> 3) I've started using common routines for threshold parsing and found  
> that the way that parsing occurs between plugins is inconsistent. For  
> instance, check_procs -c 1:1 means "critical if not 1 process".  
> However, check_disk -c 5% means "critical if between 0 and 5%".  
> Worse, the way the guidelines define ranges so the default is to  
> alert outside a range, which looks wrong.
> 
> I did this test to the audience at the Nagios Conference. Given a  
> command 'check_stuff -w 30:50 -c 10:30' where the result of "stuff"  
> is 15, what is the alert level raised?
> 
> Go on, have a guess!
> 
> The answer is Warning. I had two guesses of "Critical" by the crowd  
> and I think this is because you immediately assume an alert  
> **within** the range, not outside. I think this needs fixing.

I think you can argue it a bit both ways. I agree it's confusing that
different plugins are inconsistent on how they treat ranges; OTOH it
seems to me that the default range semantics do differ for different
plugins - with something like 'check_procs' it seems like exclusive
ranges are more useful than inclusive ones. 

I guess that just means people would need to get used to making that
explicit with a caret, but I'm a bit ambivalent on this one.

> PROPOSAL
> 
> So my proposal is to have a different, but complementary, method of  
> specifying thresholds:
> 
> --metric=crit/warn
> 
> The crit and warn ranges are defined as min:max (max is optional,  
> defaults to +infinity). Alert if the checked value is inside this  
> range. If you want to alert on the outside of this range, prefix the  
> range with a carat sign (^).
> 
> Crit or warn can be blank, meaning no alert to be specified for that  
> alert level.
> 
> If the metric is specified, then output + perfdata will reflect. Eg,  
> check_http --page_size=60K/40K --document_age=5s/3s will give output  
> of the document age and the page size, but not the certificate age or  
> the time taken. If you want output and perfdata without checking the  
> result, specify the metric without any values, eg check_http -- 
> certificate_age.
> 
> I think the metric name should be composed of alphanumerics and  
> underscore only, so it can map to RRD names. 

Yech. Please allow hyphens in the arg names, and map them to 
underscores in the perf data. I think --page_size looks silly compared
to --page-size - what do others think?

> If there is a many-to- 
> many mapping (eg, check_disk, looking at per mountpoint), use a key  
> prefixed at the beginning with a separating colon, eg check_disk -- 
> disk_free=2GB --inode_used=/0:500 -p / -p /var would have perf output  
> of:
> 
> /:disk_free=1.3GB;;2 /:inode_used=433;0:500; /var:disk_free=0.7GB;;2 / 
> var:inode_used=700;0:500;
> 
> Whatever processes the perf data can decide how to use the prefix  
> (save to a separate RRD?).
> 
> 
> 
> COMPLICATIONS
> 
> As this is a new command syntax I can see this being acceptable, as  
> long as the old syntax still works correctly. However, the  
> performance data part will be a problem to current parsers since I'd  
> like to redefine the meaning of warn and crit.
> 
> One option is that the new perf data is outputted in XML format. This  
> might help with structural changes in future. This also ties in with  
> a request from Gerd Muller of Netways at NagConf where he wanted some  
> metadata re: the plugin to be available (name=check_disk version=1.80).

+1 on structured rather than parsed. Though I'm also less than keen on 
XML.  Maybe YAML or JSON or something lighter instead?

My 2c Aussie (from linux.conf.au 2007).

Cheers
Gavin





More information about the Devel mailing list