[Nagiosplug-devel] RFC: Performance data guidelines

Voon, Ton Ton.Voon at egg.com
Fri Jul 11 07:08:01 CEST 2003


I'm starting to side with Kjell's and Karl's idea of labels being separate
from the units. I think that was the flaw in my original proposal - if we
can standarise on the units, then RRD generation should be fairly easy and
then you can keep labels descriptive and whatever you think is suitable for
a particular plugin.

So my amended proposal is:

- output of format 'label=value[UOM]' comma separated
- labels 1-19 characters long in class [a-zA-Z0-9_] (should spaces be
allowed?)
- special labels of warn, warnp, crit and critp (or just warn and crit with
different units?). These pass the threshold levels specified on the command
line. My idea on this is that you can then use RRD to draw yellow/red lines
to show where the warning levels are.
- values in class [-0-9.]. No spaces. Karl has a worry about returned values
from SNMP OIDs, but I think values should always be a number, so it can be
parsed to remove extraneous characters
- units one of:

no unit specified - assume a number (int or float) of things (users,
processes, load averages)
s - seconds (also, us, ms)
% - percentage
b - bytes (also kb, Mb, Tb)
c - a continuous counter (such as bytes transmitted on an interface) (Does
this interfere with a standard unit?)

So some examples:

check_ping:
PING OK - Packet loss = 0%, RTA = 1.00
ms|packet_loss=0%,rta=1ms,warnp=10%,critp=20%

check_disk: 
DISK OK [1150211 kB (57%) free on
/dev/dsk/c0t0d0s0]|free_percent=57%,free=1150Mb,warn=100Mb,warnp=10%
I still think that you do not need the total, used and used_percent because
these are calculatable from free and free_percent. I would also use free
rather than used because the lowest limit is 0 and the output shows free. I
think if you specify a set of disks, then data is returned for the total of
the disks.

check_swap:
CRITICAL - Swap used: 18% (778368 out of
4194272)|free_percent=82%,free=778Mb,warnp=5%

check_load:
OK - load average: 0.03, 0.04, 0.05|load1=0.03,warn=1,crit=2
I think we should only return performance data for 1 set of timings,
otherwise it gets very complicated (on a side issue, it is possible to have
a plugin return % values instead of load levels?)

check_procs:
OK - 5 processes running with command name
/usr/local/apache/bin/httpd|processes=5,warn=10
Hmmm, this goes against my check_disk example of using 0 as a lower bound as
check_procs can only be reported "upwards"

check_users:
USERS OK - 2 users currently logged in|users=2,warn=10,crit=20

Are we getting closer?

Ton


This private and confidential e-mail has been sent to you by Egg.
The Egg group of companies includes Egg Banking plc
(registered no. 2999842), Egg Financial Products Ltd (registered
no. 3319027) and Egg Investments Ltd (registered no. 3403963) which
carries out investment business on behalf of Egg and is regulated
by the Financial Services Authority.  
Registered in England and Wales. Registered offices: 1 Waterhouse Square,
138-142 Holborn, London EC1N 2NA.
If you are not the intended recipient of this e-mail and have
received it in error, please notify the sender by replying with
'received in error' as the subject and then delete it from your
mailbox.





More information about the Devel mailing list