diff options
Diffstat (limited to 'web')
| -rw-r--r-- | web/input/doc/new-threshold-syntax.md | 256 |
1 files changed, 256 insertions, 0 deletions
diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md new file mode 100644 index 0000000..c3eb8b7 --- /dev/null +++ b/web/input/doc/new-threshold-syntax.md | |||
| @@ -0,0 +1,256 @@ | |||
| 1 | title: New Threshold Syntax | ||
| 2 | parent: Documentation | ||
| 3 | --- | ||
| 4 | |||
| 5 | <!--% # Auto-imported from: http://nagiosplugins.org/rfc/new_threshold_syntax # %--> | ||
| 6 | |||
| 7 | # New Specification Method for Thresholds | ||
| 8 | |||
| 9 | _Ton Voon, March 17, 2008_ | ||
| 10 | |||
| 11 | ## Overview | ||
| 12 | |||
| 13 | The method for defining thresholds via the command line is inconsistent and | ||
| 14 | difficult to interpret. This proposal suggests a different way of specifying | ||
| 15 | thresholds, which will also changes the metrics of performance data returned. | ||
| 16 | |||
| 17 | ## Problem | ||
| 18 | |||
| 19 | The current method of specifying thresholds is confusing when there are | ||
| 20 | different checks required. For instance, in check\_http, to check page size | ||
| 21 | and time, you can specify -w {warn time}, -c {crit time}, -m | ||
| 22 | {minpagesize}[:maxpagesize], -M {maxage of document}. | ||
| 23 | |||
| 24 | Also, note the ways of defining the range are inconsistent. Some alert above | ||
| 25 | the value (time, maxage), some alert below the value (pagesize). This is | ||
| 26 | inconsistent for the same plugin! | ||
| 27 | |||
| 28 | So, to check that a web page is returned within 5 seconds, the minimum page | ||
| 29 | size is 10K and the maximum age is 1 day, you would invoke: | ||
| 30 | |||
| 31 | check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d | ||
| 32 | |||
| 33 | Furthermore, the current specification for ranges in the developer guidelines | ||
| 34 | fails the “obviousness” test: a range of 3:5 will alert if the value is | ||
| 35 | outside that range, rather than inside as you would expect. | ||
| 36 | |||
| 37 | Also, the performance data returned by check\_http is always time and size. | ||
| 38 | Perhaps you want only time, or you want age as well. | ||
| 39 | |||
| 40 | ## Proposal | ||
| 41 | |||
| 42 | ### Thresholds | ||
| 43 | |||
| 44 | This document proposes that threshold arguments are specified like: | ||
| 45 | |||
| 46 | --threshold={threshold definition} | ||
| 47 | --th={threshold definition} | ||
| 48 | |||
| 49 | The threshold definition is a subgetopt format of the form: | ||
| 50 | |||
| 51 | metric={metric},ok={range},warn={range},crit={range},unit={unit},prefix={SI prefix} | ||
| 52 | |||
| 53 | Where: | ||
| 54 | |||
| 55 | - ok, warn, crit are called “levels” | ||
| 56 | - any of ok, warn, crit, unit or prefix are optional | ||
| 57 | - if ok, warning and critical are not specified, then no alert is raised, | ||
| 58 | but the performance data will be returned | ||
| 59 | - the unit can be specified with plugins that do not know about the type of | ||
| 60 | value returned (SNMP, Windows performance counters, etc.) | ||
| 61 | - the prefix is used to multiply the input range and possibly for display | ||
| 62 | data. The prefixes allowed are defined by NIST: | ||
| 63 | <http://physics.nist.gov/cuu/Units/prefixes.html> | ||
| 64 | <http://physics.nist.gov/cuu/Units/binary.html> | ||
| 65 | - ok, warning or critical can be repeated to define an additional range. | ||
| 66 | This allows non-continuous ranges to be defined | ||
| 67 | - warning can be abbreviated to warn or w | ||
| 68 | - critical can be abbreviated to crit or c | ||
| 69 | |||
| 70 | ### Simple Range | ||
| 71 | |||
| 72 | The range values have two specifications: simple and complex. Simple ranges | ||
| 73 | are of the format: | ||
| 74 | |||
| 75 | start..end | ||
| 76 | |||
| 77 | Where: | ||
| 78 | |||
| 79 | - start and end must be defined | ||
| 80 | - start and end match the regular expression | ||
| 81 | /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”) | ||
| 82 | - start ≤ end | ||
| 83 | - if start = “inf”, this is negative infinity. This can also be written as | ||
| 84 | “-inf” | ||
| 85 | - if end = “inf”, this is positive infinity | ||
| 86 | - endpoints are inclusive of the range | ||
| 87 | - alert is raised if value is inside start and end range | ||
| 88 | |||
| 89 | (Note: this may be extended in future for adding multiple ranges using a | ||
| 90 | separator - I think this is catered for by repeating ok=,warn=,crit=.) | ||
| 91 | |||
| 92 | This simple range does not require quoting at the shell. | ||
| 93 | |||
| 94 | ### Complex Range | ||
| 95 | |||
| 96 | Complex ranges are defined as: | ||
| 97 | |||
| 98 | [^](start..end) | ||
| 99 | |||
| 100 | or | ||
| 101 | |||
| 102 | [^]start..end | ||
| 103 | |||
| 104 | Where: | ||
| 105 | |||
| 106 | - start and end must be defined | ||
| 107 | - start and end match the regular expression | ||
| 108 | /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”) | ||
| 109 | - start ≤ end | ||
| 110 | - if start = “inf”, this is negative infinity. This can also be written as | ||
| 111 | “-inf” | ||
| 112 | - if end = “inf”, this is positive infinity | ||
| 113 | - endpoints are excluded from the range if () are used, otherwise endpoints | ||
| 114 | are included in the range | ||
| 115 | - alert is raised if value is within start and end range, unless \^ is used, | ||
| 116 | in which case alert is raised if outside the range | ||
| 117 | |||
| 118 | Note that due to shell characters, quoting may be required. | ||
| 119 | |||
| 120 | ### Rules for Determining State | ||
| 121 | |||
| 122 | Given a numeric value, the state of the threshold is calculated from the | ||
| 123 | following ordered rules: | ||
| 124 | |||
| 125 | 1. If no levels are specified, return OK | ||
| 126 | 2. If an ok level is specified and value is within range, return OK | ||
| 127 | 3. If a critical level is specified and value is within range, return | ||
| 128 | CRITICAL | ||
| 129 | 4. If a warning level is specified and value is within range, return WARNING | ||
| 130 | 5. If an ok level is specified, return CRITICAL | ||
| 131 | 6. Otherwise return OK | ||
| 132 | |||
| 133 | ### Looking Back … | ||
| 134 | |||
| 135 | So the check\_http example becomes: | ||
| 136 | |||
| 137 | check_http -H $HOSTADDRESS$ \ | ||
| 138 | --th metric=time,ok=0..5 \ | ||
| 139 | --th metric=size,ok=10..inf,prefix=Ki \ | ||
| 140 | --th metric=age,ok=0..1,unit=d | ||
| 141 | |||
| 142 | I believe this is more readable (I’m interested in the time, the size and the | ||
| 143 | age) and more consistent (I’m alerting above 5, less than 10 and above 1, | ||
| 144 | respectively). | ||
| 145 | |||
| 146 | In addition, performance data will only be output if the metric has been | ||
| 147 | specified. So only show time performance data if “--th metric=time” has been | ||
| 148 | specified on the command line. Both warning\_range or critical\_range can be | ||
| 149 | unspecified - this effectively means “I am not going to alert on this value, | ||
| 150 | but I’d like to be informed about it in the performance data”. | ||
| 151 | |||
| 152 | Because the specification for a range has changed, the warning and critical | ||
| 153 | parts of the performance data can no longer be guaranteed. There is an | ||
| 154 | additional piece of work required to fix a new format for performance data. | ||
| 155 | However, the basic | ||
| 156 | |||
| 157 | label=value[uom] | ||
| 158 | |||
| 159 | Will still be valid. | ||
| 160 | |||
| 161 | ### Examples | ||
| 162 | |||
| 163 | Other examples. | ||
| 164 | |||
| 165 | To check httpd processes are OK if the virtual size is under 8096 bytes. Warn | ||
| 166 | until they reach 16182, but bigger than that is CRITICAL. | ||
| 167 | |||
| 168 | # old | ||
| 169 | check_procs -w 8096 -c 16182 -C httpd --metric VSZ | ||
| 170 | |||
| 171 | # new | ||
| 172 | check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 | ||
| 173 | |||
| 174 | There should always be one and only one ‘tnslsnr’ process. Otherwise | ||
| 175 | critical. | ||
| 176 | |||
| 177 | # old | ||
| 178 | check_procs -w 1:1 -c 1:1 -C tnslsnr | ||
| 179 | |||
| 180 | # new | ||
| 181 | check_procs -C tnslsnr --th metric=count,ok=1..1 | ||
| 182 | |||
| 183 | Load averages (1,5,15 minute) should be within reasonable ranges. | ||
| 184 | |||
| 185 | # old | ||
| 186 | check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0 | ||
| 187 | |||
| 188 | # new | ||
| 189 | check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 \ | ||
| 190 | --th metric=5min,ok=0..0.8,warn=0.8..1.3 \ | ||
| 191 | --th metric=15min,ok=0..0.7,warn=0.7..1.0 | ||
| 192 | |||
| 193 | ## Plan | ||
| 194 | |||
| 195 | I personally plan on updating check\_procs. | ||
| 196 | |||
| 197 | The basic syntax is: | ||
| 198 | |||
| 199 | check_procs [filter options] [threshold options] | ||
| 200 | |||
| 201 | Where filter options are the current -u {username}, -C {command}, etc. This | ||
| 202 | reduces the set of processes that are to be calculated. | ||
| 203 | |||
| 204 | The new threshold metrics will be: | ||
| 205 | |||
| 206 | - number - alert on number of matching processes. Performance data returns | ||
| 207 | number of processes | ||
| 208 | - rss-threshold - alert on rss size if any matching process is in range. | ||
| 209 | Perf data returns average rss | ||
| 210 | - rss-max - Same as --rss, but perf data returns max rss | ||
| 211 | - rss-sum - alert on the total rss of all matching processes. Perf data | ||
| 212 | returns rss\_sum | ||
| 213 | - vsz-threshold - alert on vsz size if any matching process is in range. | ||
| 214 | Perf data returns average vsz | ||
| 215 | - vsz-max - Same as --vsz, but perf data returns max rss | ||
| 216 | - vsz-sum - alert on the total vsz of all matching processes. Perf data | ||
| 217 | returns vsz\_sum | ||
| 218 | - cpu-threshold - alert on cpu % of all matching processes. Perf data | ||
| 219 | returns average cpu | ||
| 220 | - cpu-max - Same as --cpu, but perf data returns max cpu | ||
| 221 | - cpu-sum - alert on total cpu. Perf data returns cpu\_sum | ||
| 222 | |||
| 223 | There will be C library routines for parsing the threshold values. | ||
| 224 | |||
| 225 | There will be C library routines for the collection and output of the | ||
| 226 | performance data. | ||
| 227 | |||
| 228 | ## Terminology | ||
| 229 | |||
| 230 | **metric** | ||
| 231 | : Something that a check is going to be measured against. For example, for | ||
| 232 | disk checks, it could be used or free or inodes\_free; for http checks, it | ||
| 233 | could be time [taken] or size; for process checks, it could be cpu or | ||
| 234 | number [of processes] or vsz | ||
| 235 | |||
| 236 | **range** | ||
| 237 | : This defines a continuous range of values when an alert would be raised | ||
| 238 | |||
| 239 | **level** | ||
| 240 | : This is an alert level within Nagios - OK, WARNING or CRITICAL | ||
| 241 | |||
| 242 | **threshold** | ||
| 243 | : This consists of a level with a range | ||
| 244 | |||
| 245 | ## Limitations | ||
| 246 | |||
| 247 | This assumes that you are always comparing numbers as the metric values. | ||
| 248 | |||
| 249 | There maybe some limitations in the precision of values. All internal logic | ||
| 250 | should use double precision. | ||
| 251 | |||
| 252 | If there are multiple metrics, the alert will be on an OR basis, that is, any | ||
| 253 | single metric which passes its threshold will cause the plugin to return a | ||
| 254 | failed state. | ||
| 255 | |||
| 256 | <!--% # vim:set filetype=markdown textwidth=78 joinspaces: # %--> | ||
