[Nagiosplug-devel] RFC: Performance data guidelines

Hoogendijk, Peter Peter.Hoogendijk at atosorigin.com
Mon Jul 14 06:44:49 CEST 2003


Karl, Ton,

I have been thinking about this during the weekend. In my opinion there
are two types of plugins:

  1) Plugins that perform a specific (direct) check and return a
specific answer. In this case you (the author of the plugin) can make an
exact choice about both the plugin output and the performance data
format.

  2) Plugins that perform a lookup (indirect) check and return (an
interpretation) of the result. This is the case with plugins checking
SNMP or the Microsoft Windows Perfmon data.

This second type of plugin is causing the problems. Karl remarks that
'spaces in attributes seem avoidable', but looking at the results
returned by Microsoft Windows Perfmon, we see a lot of objects counters
and results with spaces:

  '\System\System Up Time'='15693 sec'

We could decide to remove the spaces, or replace them by underscores,
but this makes the whole process less transparent. As a result, I prefer
a set of guidelines that allows for strings containing any characters.
To summarize the questions I came up with while defining the
output/perfdata format for a lookup (indirect) plugin:

- Do I use single quotes or double quotes?
- How do I escape this character if it exists in a string?
- Do I use spaces or comma's to separate the data?

I myself prefer to use single quotes as used in mySql queries: put
single quotes around the string and double any single quotes in the
string itself. For the seperating character I have no preference: I just
used the character as proposed in the 'Performance Data' chapter of the
Nagios documentation.

Peter.

P.S. If the strings themself contain spaces, but don't contain '='
characters or seperator characters, the quotes aren't even needed!


-----Original Message-----
From: Karl DeBisschop [mailto:karl at debisschop.net] 
Sent: vrijdag 11 juli 2003 06:38
To: Voon, Ton
Cc: Hoogendijk, Peter; NagiosPlug Devel
Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines


On Thu, 2003-07-10 at 10:30, Voon, Ton wrote:

> I like the idea of quoting the attributes/values, but I don't think 
> they will be necessary if we get the standard attributes and their 
> values right.

I agree somewhat - spaces in attributes especially seem avoidable.

> I think perfdata should be space separated data (just to save 
> processing), but I'm happy to take a consensus. Comma separated may 
> make it a bit easier to parse visually. Any other opinions?

While spaces in attributes seem avoidable, I am less sure about spaces
in values. I could imagine a plugin where the perf data was a string
from a SNMP OID, where we would not really have control over what was in
that string.

> Based on my guidelines, an example output of check_ping would be:
> 
> PING OK - Packet loss = 0%, RTA = 1.96 ms|pct=0 time=1.96

Why do we not allow the plugin perf data to return units like:

  PING OK - Packet loss = 0%, RTA = 1.96 ms|loss=0%,time=1.96 ms

I only ask because there are implementations of ping that can return
'us' instead of 'ms' - I've alwys felt things are less likely to get
confused if you keep units explicit (juat ask NASA and the mars lander
team).

> Three things that spring to mind:
> - it's a bit shorter!

Short is good. But not so good that reliability, accuracy, or reasonable
clarity should be sacrificed.

> - time means something different from check_http, check_tcp, etc. 
> Those mean "time taken to do a check". For check_ping, it would mean 
> average time for a packet

Hense the idea of allowing units

> - pct is at 0, which is a "good" result (0% packet loss). However - 
> according to my proposal - check_disk would return pct=5 for 5% free 
> on total disk, which, as it gets closer to 0%, would be "bad". Maybe 
> it should be reversed, so pct=100% to mean no packet loss - should 0% 
> always be considered the worst case? This may not be easy for "number"

> attributes.

If you allow units, check_disk could return either 

  DISK OK [6390 MB (42%) free on /]|free=42%

or

  DISK OK [6390 MB (42%) free on /]|used=58%

And I would suggest the latter.

> As you can see, it is hard to standardise on what the values actually 
> tell you. This is what I meant by "Why the returned values are bad is 
> then up to interpretation (and that is the key to any performance 
> analysis!)". However, what the guidelines will do is allow the RRD 
> generation to happen easier.
>
> > From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com]
> >
> > We are in the process of developing a plugin to check information 
> > collected by another datacollection system. Based on the 
> > 'Performance Data' chapter in the Nagios documentation, we decided 
> > on comma-separated 'name=value' pairs. As we want to be able to 
> > transparently support the names and values used by the other system,

> > both the name and the value part can optionally be quoted (with 
> > either single or double quotes). The
> > result is:
> > 
> > 	Plugin Output|name1=value1, 'name 2'=value2, name3='11"', 
> > name4="Peter's PC"
> > 
> > To check our procedures for processing the performance data, I also 
> > modified the check_ping plugin. It now reports:
> > 
> > 	PING OK - Packet loss = 0%, RTA = 1.96 ms|"Packet loss"=0% 
> > RTA="1.96 ms"
> > 
> > The problem we are facing with this format is indeed the
> > interpretation by RRD (or in our case the script that's
> > feeding RRD), so we are open for suggestions. Your proposed 
> > guideline at least seems to help us find the right direction.
> >
> > > From: Voon, Ton [mailto:Ton.Voon at egg.com]
> > > 
> > > One of the features required for 1.4 is performance data. I would 
> > > like to write up the guidelines for this, but wanted confirmation 
> > > if this is the right way to go, so any comments would be 
> > > appreciated.

Ton - thanks for kicking this off - sorry I was unable to respond
immediately.

> > > I think perf data should have/be:
> > > 
> > > - short labels
> > > - generic and common labels across plugins if possible
> > > - comma separated, no spaces. Regex format: 
> > > [a-z0-9]+=[0-9]?\.?[0-9]+
> > > - redundant data removed (eg, if check_disk returns pct and number
> > > (free), can calculate used bytes)
> > > 
> > > My suggestion for labels are:
> > > 
> > > Name ; Units ; printf format ; Details
> > > time ; seconds ; %.3f ; time taken to do a specific check (eg
> > > DNS query,
> > > HTTP request, ping RTA) pct ; percent ; %.3f ; percentage (free
rather
> > > than used if applicable) (eg total disk, total swap, ping 
> > > percent loss)
> > > number ; must be bytes if applicable ; %d ; a given number of
things
> > > (free rather than used if applicable) (eg processes, users, bytes
used
> > > such as total disk or total swap) numberf ; float ; %.3f ; a given
> > > number of things that may be fractional (eg, load average, 
> > > average bytes
> > > transmitted) counter ; a continuous counter (must be bytes if
> > > applicable) ; %d ; a continuous counter (eg bytes transmitted on
an
> > > interface) load1 ; load ; %.2f ; load average over 1 min 
> > > load5 ; load ;
> > > %.2f ; load average over 5 min load15 ; load ; %.2f ; load 
> > > average over
> > > 15 min
> > > 
> > > Contentious points:
> > > - loadx. Not really keen on these, but don't seem to fit into
> > > any other
> > > labels, unless we only return load5 and use numberf
> > > - taking free values rather than used. This is consistent with the
> > > output for check_disk and check_swap. Looking at graphs, I guess
you
> > > want to see it nearer zero which is your definite limit, rather
than
> > > continuously increasing
> > > - maybe numberf is not required, but we say that number could be
> > > fractional. I think this maybe better as RRD doesn't care 
> > > whether values
> > > are integers or not
> > > - too reductionalist? Would you prefer labels that describe 
> > > the measure?
> > > I think the labels should be generic and the plugin describes the
> > > context
> > > 
> > > As an example, the patches submitted on SF for check_ping had perf

> > > labels of rta and loss, but I think these should be time and pct 
> > > respectively. I think this makes it easier for something like RRD 
> > > to work out what type of value it is to draw the graphs. Why the 
> > > returned values are bad is then up to interpretation (and that is 
> > > the key to any performance analysis!).

--
Karl





More information about the Devel mailing list