[Nagiosplug-devel] RFC: Performance data guidelines

Karl DeBisschop karl at debisschop.net
Thu Jul 10 21:40:05 CEST 2003


On Thu, 2003-07-10 at 10:30, Voon, Ton wrote:

> I like the idea of quoting the attributes/values, but I don't think they
> will be necessary if we get the standard attributes and their values right. 

I agree somewhat - spaces in attributes especially seem avoidable.

> I think perfdata should be space separated data (just to save processing),
> but I'm happy to take a consensus. Comma separated may make it a bit easier
> to parse visually. Any other opinions?

While spaces in attributes seem avoidable, I am less sure about spaces
in values. I could imagine a plugin where the perf data was a string
from a SNMP OID, where we would not really have control over what was in
that string.

> Based on my guidelines, an example output of check_ping would be:
> 
> PING OK - Packet loss = 0%, RTA = 1.96 ms|pct=0 time=1.96

Why do we not allow the plugin perf data to return units like:

  PING OK - Packet loss = 0%, RTA = 1.96 ms|loss=0%,time=1.96 ms

I only ask because there are implementations of ping that can return 'us'
instead of 'ms' - I've alwys felt things are less likely to get confused
if you keep units explicit (juat ask NASA and the mars lander team).

> Three things that spring to mind:
> - it's a bit shorter!

Short is good. But not so good that reliability, accuracy, or reasonable
clarity should be sacrificed.

> - time means something different from check_http, check_tcp, etc. Those mean
> "time taken to do a check". For check_ping, it would mean average time for a
> packet

Hense the idea of allowing units

> - pct is at 0, which is a "good" result (0% packet loss). However -
> according to my proposal - check_disk would return pct=5 for 5% free on
> total disk, which, as it gets closer to 0%, would be "bad". Maybe it should
> be reversed, so pct=100% to mean no packet loss - should 0% always be
> considered the worst case? This may not be easy for "number" attributes.

If you allow units, check_disk could return either 

  DISK OK [6390 MB (42%) free on /]|free=42%

or

  DISK OK [6390 MB (42%) free on /]|used=58%

And I would suggest the latter.

> As you can see, it is hard to standardise on what the values actually tell
> you. This is what I meant by "Why the returned values are bad is then up to
> interpretation (and that is the key to any performance analysis!)". However,
> what the guidelines will do is allow the RRD generation to happen easier.
>
> > From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com] 
> >
> > We are in the process of developing a plugin to check information
> > collected by another datacollection system. Based on the 'Performance
> > Data' chapter in the Nagios documentation, we decided on 
> > comma-separated
> > 'name=value' pairs. As we want to be able to transparently support the
> > names and values used by the other system, both the name and the value
> > part can optionally be quoted (with either single or double 
> > quotes). The
> > result is:
> > 
> > 	Plugin Output|name1=value1, 'name 2'=value2, name3='11"',
> > name4="Peter's PC"
> > 
> > To check our procedures for processing the performance data, I also
> > modified the check_ping plugin. It now reports:
> > 
> > 	PING OK - Packet loss = 0%, RTA = 1.96 ms|"Packet loss"=0%
> > RTA="1.96 ms"
> > 
> > The problem we are facing with this format is indeed the 
> > interpretation by RRD (or in our case the script that's
> > feeding RRD), so we are open for suggestions. Your proposed 
> > guideline at least seems to help us find the right direction.
> >
> > > From: Voon, Ton [mailto:Ton.Voon at egg.com] 
> > > 
> > > One of the features required for 1.4 is performance data. I would like
> > > to write up the guidelines for this, but wanted confirmation 
> > > if this is the right way to go, so any comments would be appreciated.

Ton - thanks for kicking this off - sorry I was unable to respond
immediately.

> > > I think perf data should have/be:
> > > 
> > > - short labels
> > > - generic and common labels across plugins if possible
> > > - comma separated, no spaces. Regex format: [a-z0-9]+=[0-9]?\.?[0-9]+
> > > - redundant data removed (eg, if check_disk returns pct and number
> > > (free), can calculate used bytes)
> > > 
> > > My suggestion for labels are:
> > > 
> > > Name ; Units ; printf format ; Details
> > > time ; seconds ; %.3f ; time taken to do a specific check (eg 
> > > DNS query,
> > > HTTP request, ping RTA) pct ; percent ; %.3f ; percentage (free rather
> > > than used if applicable) (eg total disk, total swap, ping 
> > > percent loss)
> > > number ; must be bytes if applicable ; %d ; a given number of things
> > > (free rather than used if applicable) (eg processes, users, bytes used
> > > such as total disk or total swap) numberf ; float ; %.3f ; a given
> > > number of things that may be fractional (eg, load average, 
> > > average bytes
> > > transmitted) counter ; a continuous counter (must be bytes if
> > > applicable) ; %d ; a continuous counter (eg bytes transmitted on an
> > > interface) load1 ; load ; %.2f ; load average over 1 min 
> > > load5 ; load ;
> > > %.2f ; load average over 5 min load15 ; load ; %.2f ; load 
> > > average over
> > > 15 min
> > > 
> > > Contentious points:
> > > - loadx. Not really keen on these, but don't seem to fit into 
> > > any other
> > > labels, unless we only return load5 and use numberf
> > > - taking free values rather than used. This is consistent with the
> > > output for check_disk and check_swap. Looking at graphs, I guess you
> > > want to see it nearer zero which is your definite limit, rather than
> > > continuously increasing
> > > - maybe numberf is not required, but we say that number could be
> > > fractional. I think this maybe better as RRD doesn't care 
> > > whether values
> > > are integers or not
> > > - too reductionalist? Would you prefer labels that describe 
> > > the measure?
> > > I think the labels should be generic and the plugin describes the
> > > context
> > > 
> > > As an example, the patches submitted on SF for check_ping had perf
> > > labels of rta and loss, but I think these should be time and pct
> > > respectively. I think this makes it easier for something like RRD to
> > > work out what type of value it is to draw the graphs. Why the returned
> > > values are bad is then up to interpretation (and that is the 
> > > key to any
> > > performance analysis!).

--
Karl





More information about the Devel mailing list