[Nagiosplug-devel] RFC: Performance data guidelines

kjell.sundtjonn at elkem.no kjell.sundtjonn at elkem.no
Fri Jul 11 01:15:14 CEST 2003


As I understand it, the major reason for introducing performance data is to
be able to integrate Nagios with RRDtool.
(Performance data is an open architecture, but it seems that integration
with RRDtool is what everyone is talking about).

With that as a background I will propose a layout for the PING example as

   PING OK - Packet loss = 0%, RTA = 1.96 ms|Packet_loss=0%,RTA=1.96ms

More generally, performance data should be a comma separated
'name=value[UOM]' list.
Name should be a valid and meaningful RRD DataSource name (1 to 19
characters long in the characters [a-zA-Z0-9_]).
UOM is optional unit of measurement (%, MB etc, no whitespace alloved).

This format is easy to parse and generate RRDtools update statements with
the RRD datasource name is
given directly in the performance data string.
Given some reasonable assumption about the consolidation structure of our
RRD databases you
should even be able to create new RRD databases on-the-fly for any new
service that
starts to deliver performance data through Nagios.



Kjell Sundtjønn



|---------+-------------------------------------------->
|         |           Karl DeBisschop                  |
|         |           <karl at debisschop.net>            |
|         |           Sent by:                         |
|         |           nagiosplug-devel-admin at lists.sour|
|         |           ceforge.net                      |
|         |                                            |
|         |                                            |
|         |           11.07.2003 06:37                 |
|         |                                            |
|---------+-------------------------------------------->
  >------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                              |
  |       To:       "Voon, Ton" <Ton.Voon at egg.com>                                                                               |
  |       cc:       "'Hoogendijk, Peter'" <Peter.Hoogendijk at atosorigin.com>, NagiosPlug Devel                                    |
  |        <nagiosplug-devel at lists.sourceforge.net>                                                                              |
  |       Subject:  RE: [Nagiosplug-devel] RFC: Performance data guidelines                                                      |
  >------------------------------------------------------------------------------------------------------------------------------|




On Thu, 2003-07-10 at 10:30, Voon, Ton wrote:

> I like the idea of quoting the attributes/values, but I don't think they
> will be necessary if we get the standard attributes and their values
right.

I agree somewhat - spaces in attributes especially seem avoidable.

> I think perfdata should be space separated data (just to save
processing),
> but I'm happy to take a consensus. Comma separated may make it a bit
easier
> to parse visually. Any other opinions?

While spaces in attributes seem avoidable, I am less sure about spaces
in values. I could imagine a plugin where the perf data was a string
from a SNMP OID, where we would not really have control over what was in
that string.

> Based on my guidelines, an example output of check_ping would be:
>
> PING OK - Packet loss = 0%, RTA = 1.96 ms|pct=0 time=1.96

Why do we not allow the plugin perf data to return units like:

  PING OK - Packet loss = 0%, RTA = 1.96 ms|loss=0%,time=1.96 ms

I only ask because there are implementations of ping that can return 'us'
instead of 'ms' - I've alwys felt things are less likely to get confused
if you keep units explicit (juat ask NASA and the mars lander team).

> Three things that spring to mind:
> - it's a bit shorter!

Short is good. But not so good that reliability, accuracy, or reasonable
clarity should be sacrificed.

> - time means something different from check_http, check_tcp, etc. Those
mean
> "time taken to do a check". For check_ping, it would mean average time
for a
> packet

Hense the idea of allowing units

> - pct is at 0, which is a "good" result (0% packet loss). However -
> according to my proposal - check_disk would return pct=5 for 5% free on
> total disk, which, as it gets closer to 0%, would be "bad". Maybe it
should
> be reversed, so pct=100% to mean no packet loss - should 0% always be
> considered the worst case? This may not be easy for "number" attributes.

If you allow units, check_disk could return either

  DISK OK [6390 MB (42%) free on /]|free=42%

or

  DISK OK [6390 MB (42%) free on /]|used=58%

And I would suggest the latter.

> As you can see, it is hard to standardise on what the values actually
tell
> you. This is what I meant by "Why the returned values are bad is then up
to
> interpretation (and that is the key to any performance analysis!)".
However,
> what the guidelines will do is allow the RRD generation to happen easier.
>
> > From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com]
> >
> > We are in the process of developing a plugin to check information
> > collected by another datacollection system. Based on the 'Performance
> > Data' chapter in the Nagios documentation, we decided on
> > comma-separated
> > 'name=value' pairs. As we want to be able to transparently support the
> > names and values used by the other system, both the name and the value
> > part can optionally be quoted (with either single or double
> > quotes). The
> > result is:
> >
> >          Plugin Output|name1=value1, 'name 2'=value2, name3='11"',
> > name4="Peter's PC"
> >
> > To check our procedures for processing the performance data, I also
> > modified the check_ping plugin. It now reports:
> >
> >          PING OK - Packet loss = 0%, RTA = 1.96 ms|"Packet loss"=0%
> > RTA="1.96 ms"
> >
> > The problem we are facing with this format is indeed the
> > interpretation by RRD (or in our case the script that's
> > feeding RRD), so we are open for suggestions. Your proposed
> > guideline at least seems to help us find the right direction.
> >
> > > From: Voon, Ton [mailto:Ton.Voon at egg.com]
> > >
> > > One of the features required for 1.4 is performance data. I would
like
> > > to write up the guidelines for this, but wanted confirmation
> > > if this is the right way to go, so any comments would be appreciated.

Ton - thanks for kicking this off - sorry I was unable to respond
immediately.

> > > I think perf data should have/be:
> > >
> > > - short labels
> > > - generic and common labels across plugins if possible
> > > - comma separated, no spaces. Regex format:
[a-z0-9]+=[0-9]?\.?[0-9]+ > > > - redundant data removed (eg, if check_disk
returns pct and number
> > > (free), can calculate used bytes)
> > >
> > > My suggestion for labels are:
> > >
> > > Name ; Units ; printf format ; Details
> > > time ; seconds ; %.3f ; time taken to do a specific check (eg
> > > DNS query,
> > > HTTP request, ping RTA) pct ; percent ; %.3f ; percentage (free
rather
> > > than used if applicable) (eg total disk, total swap, ping
> > > percent loss)
> > > number ; must be bytes if applicable ; %d ; a given number of things
> > > (free rather than used if applicable) (eg processes, users, bytes
used
> > > such as total disk or total swap) numberf ; float ; %.3f ; a given
> > > number of things that may be fractional (eg, load average,
> > > average bytes
> > > transmitted) counter ; a continuous counter (must be bytes if
> > > applicable) ; %d ; a continuous counter (eg bytes transmitted on an
> > > interface) load1 ; load ; %.2f ; load average over 1 min
> > > load5 ; load ;
> > > %.2f ; load average over 5 min load15 ; load ; %.2f ; load
> > > average over
> > > 15 min
> > >
> > > Contentious points:
> > > - loadx. Not really keen on these, but don't seem to fit into
> > > any other
> > > labels, unless we only return load5 and use numberf
> > > - taking free values rather than used. This is consistent with the
> > > output for check_disk and check_swap. Looking at graphs, I guess you
> > > want to see it nearer zero which is your definite limit, rather than
> > > continuously increasing
> > > - maybe numberf is not required, but we say that number could be
> > > fractional. I think this maybe better as RRD doesn't care
> > > whether values
> > > are integers or not
> > > - too reductionalist? Would you prefer labels that describe
> > > the measure?
> > > I think the labels should be generic and the plugin describes the
> > > context
> > >
> > > As an example, the patches submitted on SF for check_ping had perf
> > > labels of rta and loss, but I think these should be time and pct
> > > respectively. I think this makes it easier for something like RRD to
> > > work out what type of value it is to draw the graphs. Why the
returned
> > > values are bad is then up to interpretation (and that is the
> > > key to any
> > > performance analysis!).

--
Karl



-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1
_______________________________________________
Nagiosplug-devel mailing list
Nagiosplug-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagiosplug-devel
::: Please include plugins version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null









More information about the Devel mailing list