[Nagiosplug-devel] RFC: new style command arguments for thresholds

Andreas Ericsson ae at op5.se
Tue Jan 16 14:34:55 CET 2007


Ton Voon wrote:
> Hi!
> 
> I'm canvassing opinions for this change to the developer guidelines re: 
> command arguments to thresholds. I first brought this up at the Nagios 
> Conference in Germany 
> (http://www.netways.de/de/nagios_konferenz/archiv_2006/programm/nagios_plugins/), 
> but want to make sure there is a consensus in this mailing list.
> 
> 
> BACKGROUND
> 
> There are three main problems:
> 
> 1) when you have a check that wants to check multiple "things", the 
> syntax is confusing. For example, free disk space in check_disk is -w/-c 
> (in units or percent), but inode checking is -W/-K. In check_http, -w/-c 
> is for time taken, -m is for page size. This is not very readable and 
> inconsistent
> 
> 2) the output and performance data is inconsistent with what is being 
> checked. For instance, if I check my disks for inodes, I don't 
> necessarily want perf data returned about disk free. This clogs up my 
> graphs and muddies my output
> 
> 3) I've started using common routines for threshold parsing and found 
> that the way that parsing occurs between plugins is inconsistent. For 
> instance, check_procs -c 1:1 means "critical if not 1 process". However, 
> check_disk -c 5% means "critical if between 0 and 5%". Worse, the way 
> the guidelines define ranges so the default is to alert outside a range, 
> which looks wrong.
> 
> I did this test to the audience at the Nagios Conference. Given a 
> command 'check_stuff -w 30:50 -c 10:30' where the result of "stuff" is 
> 15, what is the alert level raised?
> 
> Go on, have a guess!
> 
> The answer is Warning. I had two guesses of "Critical" by the crowd and 
> I think this is because you immediately assume an alert **within** the 
> range, not outside. I think this needs fixing.
> 

I disagree, but see below.

> 
> 
> PROPOSAL
> 
> So my proposal is to have a different, but complementary, method of 
> specifying thresholds:
> 
> --metric=crit/warn
> 


Using '/' for argument separation makes it look completely insane when 
dealing with numerical thresholds. I'd much rather just abuse the common 
comma instead, and put warning before critical on the basis that it 
feels more right if the lower value comes first (which it will most of 
the time, I suspect).

Otherwise it's a nice idea.

> The crit and warn ranges are defined as min:max (max is optional, 
> defaults to +infinity). Alert if the checked value is inside this range. 
> If you want to alert on the outside of this range, prefix the range with 
> a carat sign (^).
> 

Swapping the meaning of ranges around is just pure daft, because whether 
you want the value to be inside or outside the range is all down to 
context. When checking process counts, temperature, humidity you want 
the measured metric to be inside the ranges to be considered OK. I'm not 
really sure what kind of measured value you'd want to be outside the 
ranges to be considered OK, so this only holds true when you look at the 
argument without knowing what you're checking. It *might* be considered 
a bug that the warning range doesn't necessarily have to be "inside" the 
critical range, but that's just an implementation detail.

> Crit or warn can be blank, meaning no alert to be specified for that 
> alert level.
> 
> If the metric is specified, then output + perfdata will reflect. Eg, 
> check_http --page_size=60K/40K --document_age=5s/3s will give output of 
> the document age and the page size, but not the certificate age or the 
> time taken. If you want output and perfdata without checking the result, 
> specify the metric without any values, eg check_http --certificate_age.
> 
> I think the metric name should be composed of alphanumerics and 
> underscore only, so it can map to RRD names. If there is a many-to-many 
> mapping (eg, check_disk, looking at per mountpoint), use a key prefixed 
> at the beginning with a separating colon, eg check_disk --disk_free=2GB 
> --inode_used=/0:500 -p / -p /var would have perf output of:
> 
> /:disk_free=1.3GB;;2 /:inode_used=433;0:500; /var:disk_free=0.7GB;;2 
> /var:inode_used=700;0:500;
> 
> Whatever processes the perf data can decide how to use the prefix (save 
> to a separate RRD?).
> 

Sounds sensible. Personally, I'd tack a massaged version of the 
disk-partition (or mountpoint) to the metric in the performance-data, so 
the above would be something like

/_disk_free=...  /_inode_used=....   /var_disk_free=...

This way it can still be used in a single RRD-file without the parser 
having to modify it (I'm assuming RRD can handle / here).

> 
> 
> COMPLICATIONS
> 
> As this is a new command syntax I can see this being acceptable, as long 
> as the old syntax still works correctly. However, the performance data 
> part will be a problem to current parsers since I'd like to redefine the 
> meaning of warn and crit.
> 

So long as things stay backwards compatible I have no objections. I'm 
not sure what you mean by "re-defining the meaning of warn and crit", as 
you haven't mentioned anything about it earlier.


> One option is that the new perf data is outputted in XML format. This 
> might help with structural changes in future. This also ties in with a 
> request from Gerd Muller of Netways at NagConf where he wanted some 
> metadata re: the plugin to be available (name=check_disk version=1.80).
> 
> 
> Any opinions?
> 

XML output is not a good idea. <> sequences passed to the shell will 
play merry hell with peoples filesystems and result in seriously strange 
errors. Even though XML is supposed to be human-readable, it really 
isn't. There's no reason why plugin_name and plugin_version can't be in 
the perf-data output as it is today. Just consign the plugin_ prefix to 
be protected for meta-data in performance output.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231




More information about the Devel mailing list