web/input/doc/new-threshold-syntax.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255

title: New Threshold Syntax
parent: Documentation
---

# New Specification Method for Thresholds

_Ton Voon, March 17, 2008_

## Overview

The method for defining thresholds via the command line is inconsistent and
difficult to interpret.  This proposal suggests a different way of specifying
thresholds, which will also change the metrics of performance data returned.

## Problem

The current method of specifying thresholds is confusing when there are
different checks required.  For instance, in `check_http`, to check page size
and time, you can specify `-w {warn time}`, `-c {crit time}`,
`-m {minpagesize}[:maxpagesize]`, `-M {maxage of document}`.

Also, note the ways of defining the range are inconsistent.  Some alert above
the value (time, maxage), some alert below the value (pagesize).  This is
inconsistent for the same plugin!

So, to check that a web page is returned within 5 seconds, the minimum page
size is 10K and the maximum age is 1 day, you would invoke:

    check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d

Furthermore, the current specification for ranges in the developer guidelines
fails the “obviousness” test: a range of 3:5 will alert if the value is
outside that range, rather than inside as you would expect.

Also, the performance data returned by `check_http` is always time and size.
Perhaps you want only time, or you want age as well.

## Proposal

### Thresholds

This document proposes that threshold arguments are specified like:

    --threshold={threshold definition}
    --th={threshold definition}

The threshold definition is a subgetopt format of the form:

    metric={metric},ok={range},warn={range},crit={range},unit={unit},prefix={SI prefix}

Where:

- `ok`, `warn`, `crit` are called “levels”
- any of `ok`, `warn`, `crit`, `unit` or `prefix` are optional
- if `ok`, `warning` and `critical` are not specified, then no alert is
  raised, but the performance data will be returned
- the `unit` can be specified with plugins that do not know about the type of
  value returned (SNMP, Windows performance counters, etc.)
- the `prefix` is used to multiply the input range and possibly for display
  data.  The prefixes allowed are defined by NIST:  
  <http://physics.nist.gov/cuu/Units/prefixes.html>  
  <http://physics.nist.gov/cuu/Units/binary.html>
- `ok`, `warning` or `critical` can be repeated to define an additional range.
  This allows non-continuous ranges to be defined
- `warning` can be abbreviated to `warn` or `w`
- `critical` can be abbreviated to `crit` or `c`

### Simple Range

The range values have two specifications: simple and complex.  Simple ranges
are of the format:

    start..end

Where:

- `start` and `end` must be defined
- `start` and `end` match the regular expression
  `/^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”)
- `start ≤ end`
- if `start` = `inf`, this is negative infinity.  This can also be written as
  `-inf`
- if `end` = `inf`, this is positive infinity
- endpoints are inclusive of the range
- alert is raised if value is inside `start` and `end` range

(Note: this may be extended in future for adding multiple ranges using a
separator - I think this is catered for by repeating `ok=,warn=,crit=`.)

This simple range does not require quoting at the shell.

### Complex Range

Complex ranges are defined as:

    [^](start..end)

or

    [^]start..end

Where:

- `start` and `end` must be defined
- `start` and `end` match the regular expression
  `/\^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”)
- `start` ≤ `end`
- if `start` = `inf`, this is negative infinity.  This can also be written as
  `-inf`
- if `end` = `inf`, this is positive infinity
- endpoints are excluded from the range if () are used, otherwise endpoints
  are included in the range
- alert is raised if value is within `start` and `end` range, unless `^` is
  used, in which case alert is raised if outside the range

Note that due to shell characters, quoting may be required.

### Rules for Determining State

Given a numeric value, the state of the threshold is calculated from the
following ordered rules:

1. If no levels are specified, return `OK`
2. If an `ok` level is specified and value is within range, return `OK`
3. If a `critical` level is specified and value is within range, return
   `CRITICAL`
4. If a `warning` level is specified and value is within range, return
   `WARNING`
5. If an `ok` level is specified, return `CRITICAL`
6. Otherwise return `OK`

### Looking Back …

So the `check_http` example becomes:

    check_http -H $HOSTADDRESS$ \
               --th metric=time,ok=0..5 \
               --th metric=size,ok=10..inf,prefix=Ki \
               --th metric=age,ok=0..1,unit=d

I believe this is more readable (I’m interested in the time, the size and the
age) and more consistent (I’m alerting above 5, less than 10 and above 1,
respectively).

In addition, performance data will only be output if the metric has been
specified.  So only show time performance data if `--th metric=time` has been
specified on the command line.  Both warning range or critical range can be
unspecified - this effectively means “I am not going to alert on this value,
but I’d like to be informed about it in the performance data”.

Because the specification for a range has changed, the warning and critical
parts of the performance data can no longer be guaranteed.  There is an
additional piece of work required to fix a new format for performance data.
However, the basic

    label=value[uom]

will still be valid.

### Examples

Other examples.

To check httpd processes are `OK` if the virtual size is under 8096 bytes.
Warn until they reach 16182, but bigger than that is `CRITICAL`.

    # old
    check_procs -w 8096 -c 16182 -C httpd --metric VSZ

    # new
    check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182

There should always be one and only one ‘tnslsnr’ process.  Otherwise
`CRITICAL`.

    # old
    check_procs -w 1:1 -c 1:1 -C tnslsnr

    # new
    check_procs -C tnslsnr --th metric=count,ok=1..1

Load averages (1,5,15 minute) should be within reasonable ranges.

    # old
    check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0

    # new
    check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 \
               --th metric=5min,ok=0..0.8,warn=0.8..1.3 \
               --th metric=15min,ok=0..0.7,warn=0.7..1.0

## Plan

I personally plan on updating `check_procs`.

The basic syntax is:

    check_procs [filter options] [threshold options]

Where filter options are the current `-u {username}`, `-C {command}`, etc.
This reduces the set of processes that are to be calculated.

The new threshold metrics will be:

- number - alert on number of matching processes.  Performance data returns
  number of processes
- rss-threshold - alert on rss size if any matching process is in range.  Perf
  data returns average rss
- rss-max - Same as `--rss`, but perf data returns max rss
- rss-sum - alert on the total rss of all matching processes.  Perf data
  returns rss\_sum
- vsz-threshold - alert on vsz size if any matching process is in range.  Perf
  data returns average vsz
- vsz-max - Same as `--vsz`, but perf data returns max rss
- vsz-sum - alert on the total vsz of all matching processes.  Perf data
  returns vsz\_sum
- cpu-threshold - alert on cpu % of all matching processes.  Perf data returns
  average cpu
- cpu-max - Same as `--cpu`, but perf data returns max cpu
- cpu-sum - alert on total cpu.  Perf data returns cpu\_sum

There will be C library routines for parsing the threshold values.

There will be C library routines for the collection and output of the
performance data.

## Terminology

**metric**
:   Something that a check is going to be measured against.  For example, for
    disk checks, it could be used or free or inodes\_free; for HTTP checks, it
    could be time taken or size; for process checks, it could be cpu or
    number of processes or vsz

**range**
:   This defines a continuous range of values when an alert would be raised

**level**
:   This is an alert level within Nagios - `OK`, `WARNING` or `CRITICAL`

**threshold**
:   This consists of a level with a range

## Limitations

This assumes that you are always comparing numbers as the metric values.

There maybe some limitations in the precision of values.  All internal logic
should use double precision.

If there are multiple metrics, the alert will be on an OR basis, that is, any
single metric which passes its threshold will cause the plugin to return a
failed state.

<!--% # vim:set filetype=markdown textwidth=78 joinspaces expandtab: # %-->