From c6d1c24d8c06da644931851ba8b0ffbd50c02d16 Mon Sep 17 00:00:00 2001 From: Holger Weiss Date: Mon, 28 Oct 2013 22:25:31 +0100 Subject: new-threshold-syntax.md: Various cosmetic changes Fix a few small typos and apply various cosmetic changes. diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md index c3eb8b7..4cf8cf6 100644 --- a/web/input/doc/new-threshold-syntax.md +++ b/web/input/doc/new-threshold-syntax.md @@ -11,18 +11,18 @@ _Ton Voon, March 17, 2008_ ## Overview The method for defining thresholds via the command line is inconsistent and -difficult to interpret. This proposal suggests a different way of specifying -thresholds, which will also changes the metrics of performance data returned. +difficult to interpret. This proposal suggests a different way of specifying +thresholds, which will also change the metrics of performance data returned. ## Problem The current method of specifying thresholds is confusing when there are -different checks required. For instance, in check\_http, to check page size -and time, you can specify -w {warn time}, -c {crit time}, -m -{minpagesize}[:maxpagesize], -M {maxage of document}. +different checks required. For instance, in `check_http`, to check page size +and time, you can specify `-w {warn time}`, `-c {crit time}`, +`-m {minpagesize}[:maxpagesize]`, `-M {maxage of document}`. -Also, note the ways of defining the range are inconsistent. Some alert above -the value (time, maxage), some alert below the value (pagesize). This is +Also, note the ways of defining the range are inconsistent. Some alert above +the value (time, maxage), some alert below the value (pagesize). This is inconsistent for the same plugin! So, to check that a web page is returned within 5 seconds, the minimum page @@ -34,7 +34,7 @@ Furthermore, the current specification for ranges in the developer guidelines fails the “obviousness” test: a range of 3:5 will alert if the value is outside that range, rather than inside as you would expect. -Also, the performance data returned by check\_http is always time and size. +Also, the performance data returned by `check_http` is always time and size. Perhaps you want only time, or you want age as well. ## Proposal @@ -52,42 +52,42 @@ The threshold definition is a subgetopt format of the form: Where: -- ok, warn, crit are called “levels” -- any of ok, warn, crit, unit or prefix are optional -- if ok, warning and critical are not specified, then no alert is raised, - but the performance data will be returned -- the unit can be specified with plugins that do not know about the type of - value returned (SNMP, Windows performance counters, etc.) -- the prefix is used to multiply the input range and possibly for display - data. The prefixes allowed are defined by NIST: - - -- ok, warning or critical can be repeated to define an additional range. - This allows non-continuous ranges to be defined -- warning can be abbreviated to warn or w -- critical can be abbreviated to crit or c +- `ok`, `warn`, `crit` are called “levels” +- any of `ok`, `warn`, `crit`, `unit` or `prefix` are optional +- if `ok`, `warning` and `critical` are not specified, then no alert is + raised, but the performance data will be returned +- the `unit` can be specified with plugins that do not know about the type of + value returned (SNMP, Windows performance counters, etc.) +- the `prefix` is used to multiply the input range and possibly for display + data. The prefixes allowed are defined by NIST: + + +- `ok`, `warning` or `critical` can be repeated to define an additional range. + This allows non-continuous ranges to be defined +- `warning` can be abbreviated to `warn` or `w` +- `critical` can be abbreviated to `crit` or `c` ### Simple Range -The range values have two specifications: simple and complex. Simple ranges +The range values have two specifications: simple and complex. Simple ranges are of the format: start..end Where: -- start and end must be defined -- start and end match the regular expression - /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”) -- start ≤ end -- if start = “inf”, this is negative infinity. This can also be written as - “-inf” -- if end = “inf”, this is positive infinity -- endpoints are inclusive of the range -- alert is raised if value is inside start and end range +- `start` and `end` must be defined +- `start` and `end` match the regular expression + `/^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”) +- `start ≤ end` +- if `start` = `inf`, this is negative infinity. This can also be written as + `-inf` +- if `end` = `inf`, this is positive infinity +- endpoints are inclusive of the range +- alert is raised if value is inside `start` and `end` range (Note: this may be extended in future for adding multiple ranges using a -separator - I think this is catered for by repeating ok=,warn=,crit=.) +separator - I think this is catered for by repeating `ok=,warn=,crit=`.) This simple range does not require quoting at the shell. @@ -103,17 +103,17 @@ or Where: -- start and end must be defined -- start and end match the regular expression - /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”) -- start ≤ end -- if start = “inf”, this is negative infinity. This can also be written as - “-inf” -- if end = “inf”, this is positive infinity -- endpoints are excluded from the range if () are used, otherwise endpoints - are included in the range -- alert is raised if value is within start and end range, unless \^ is used, - in which case alert is raised if outside the range +- `start` and `end` must be defined +- `start` and `end` match the regular expression + `/\^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”) +- `start` ≤ `end` +- if `start` = `inf`, this is negative infinity. This can also be written as + `-inf` +- if `end` = `inf`, this is positive infinity +- endpoints are excluded from the range if () are used, otherwise endpoints + are included in the range +- alert is raised if value is within `start` and `end` range, unless `^` is + used, in which case alert is raised if outside the range Note that due to shell characters, quoting may be required. @@ -122,17 +122,18 @@ Note that due to shell characters, quoting may be required. Given a numeric value, the state of the threshold is calculated from the following ordered rules: -1. If no levels are specified, return OK -2. If an ok level is specified and value is within range, return OK -3. If a critical level is specified and value is within range, return - CRITICAL -4. If a warning level is specified and value is within range, return WARNING -5. If an ok level is specified, return CRITICAL -6. Otherwise return OK +1. If no levels are specified, return `OK` +2. If an `ok` level is specified and value is within range, return `OK` +3. If a `critical` level is specified and value is within range, return + `CRITICAL` +4. If a `warning` level is specified and value is within range, return + `WARNING` +5. If an `ok` level is specified, return `CRITICAL` +6. Otherwise return `OK` ### Looking Back … -So the check\_http example becomes: +So the `check_http` example becomes: check_http -H $HOSTADDRESS$ \ --th metric=time,ok=0..5 \ @@ -144,26 +145,26 @@ age) and more consistent (I’m alerting above 5, less than 10 and above 1, respectively). In addition, performance data will only be output if the metric has been -specified. So only show time performance data if “--th metric=time” has been -specified on the command line. Both warning\_range or critical\_range can be +specified. So only show time performance data if `--th metric=time` has been +specified on the command line. Both warning range or critical range can be unspecified - this effectively means “I am not going to alert on this value, but I’d like to be informed about it in the performance data”. Because the specification for a range has changed, the warning and critical -parts of the performance data can no longer be guaranteed. There is an +parts of the performance data can no longer be guaranteed. There is an additional piece of work required to fix a new format for performance data. However, the basic label=value[uom] -Will still be valid. +will still be valid. ### Examples Other examples. -To check httpd processes are OK if the virtual size is under 8096 bytes. Warn -until they reach 16182, but bigger than that is CRITICAL. +To check httpd processes are `OK` if the virtual size is under 8096 bytes. +Warn until they reach 16182, but bigger than that is `CRITICAL`. # old check_procs -w 8096 -c 16182 -C httpd --metric VSZ @@ -171,8 +172,8 @@ until they reach 16182, but bigger than that is CRITICAL. # new check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 -There should always be one and only one ‘tnslsnr’ process. Otherwise -critical. +There should always be one and only one ‘tnslsnr’ process. Otherwise +`CRITICAL`. # old check_procs -w 1:1 -c 1:1 -C tnslsnr @@ -192,33 +193,33 @@ Load averages (1,5,15 minute) should be within reasonable ranges. ## Plan -I personally plan on updating check\_procs. +I personally plan on updating `check_procs`. The basic syntax is: check_procs [filter options] [threshold options] -Where filter options are the current -u {username}, -C {command}, etc. This -reduces the set of processes that are to be calculated. +Where filter options are the current `-u {username}`, `-C {command}`, etc. +This reduces the set of processes that are to be calculated. The new threshold metrics will be: -- number - alert on number of matching processes. Performance data returns - number of processes -- rss-threshold - alert on rss size if any matching process is in range. - Perf data returns average rss -- rss-max - Same as --rss, but perf data returns max rss -- rss-sum - alert on the total rss of all matching processes. Perf data - returns rss\_sum -- vsz-threshold - alert on vsz size if any matching process is in range. - Perf data returns average vsz -- vsz-max - Same as --vsz, but perf data returns max rss -- vsz-sum - alert on the total vsz of all matching processes. Perf data - returns vsz\_sum -- cpu-threshold - alert on cpu % of all matching processes. Perf data - returns average cpu -- cpu-max - Same as --cpu, but perf data returns max cpu -- cpu-sum - alert on total cpu. Perf data returns cpu\_sum +- number - alert on number of matching processes. Performance data returns + number of processes +- rss-threshold - alert on rss size if any matching process is in range. Perf + data returns average rss +- rss-max - Same as `--rss`, but perf data returns max rss +- rss-sum - alert on the total rss of all matching processes. Perf data + returns rss\_sum +- vsz-threshold - alert on vsz size if any matching process is in range. Perf + data returns average vsz +- vsz-max - Same as `--vsz`, but perf data returns max rss +- vsz-sum - alert on the total vsz of all matching processes. Perf data + returns vsz\_sum +- cpu-threshold - alert on cpu % of all matching processes. Perf data returns + average cpu +- cpu-max - Same as `--cpu`, but perf data returns max cpu +- cpu-sum - alert on total cpu. Perf data returns cpu\_sum There will be C library routines for parsing the threshold values. @@ -228,16 +229,16 @@ performance data. ## Terminology **metric** -: Something that a check is going to be measured against. For example, for - disk checks, it could be used or free or inodes\_free; for http checks, it - could be time [taken] or size; for process checks, it could be cpu or - number [of processes] or vsz +: Something that a check is going to be measured against. For example, for + disk checks, it could be used or free or inodes\_free; for HTTP checks, it + could be time taken or size; for process checks, it could be cpu or + number of processes or vsz **range** : This defines a continuous range of values when an alert would be raised **level** -: This is an alert level within Nagios - OK, WARNING or CRITICAL +: This is an alert level within Nagios - `OK`, `WARNING` or `CRITICAL` **threshold** : This consists of a level with a range @@ -246,7 +247,7 @@ performance data. This assumes that you are always comparing numbers as the metric values. -There maybe some limitations in the precision of values. All internal logic +There maybe some limitations in the precision of values. All internal logic should use double precision. If there are multiple metrics, the alert will be on an OR basis, that is, any -- cgit v0.10-9-g596f