[Nagiosplug-devel] RFC: Performance data guidelines

Hoogendijk, Peter Peter.Hoogendijk at atosorigin.com
Wed Jul 16 00:29:32 CEST 2003


Ton,

This certainly makes sense. I was thinking along the same lines and
concluded that I need two extra (optional) plugin options:

  1) An option to set the label: -L label (--label)
  2) An option to to specify the format of the data: -P printf
(--printf)

This solves the problem of the RRD labels. It also proves you are right
with your proposal to do the translations at the plugin, as this is also
the place where I have to configure the perfmon counter to be checked
(for this discussion I'll stick to the Microsoft Windows Perfmon
example). As a result, the perfmon plugin would take the following
options:

  -f filename (--filename)
  -C counter (--counter)
  -S scanf (--scanf)
  -L label (--label)
  -P printf (--printf)
  -w warning threshold (--warning)
  -c critical threshold (--critical)

The resulting command to perform the check would be:

  ./check_perfmon -f /var/log/perfmon/hostname -C "\System\System Up
Time" -S "%l" -L "SystemUpTime" -P "%ls"

The filename, as specified with the -f option, points to the file that
contains a list of Microsoft Windows Perfmon counters and their values
for the host being checked. This file is generated using a third-party
product, running as a service on the Microsoft Windows servers.

The option names I used are open to discussion, but the principle at
solves the problems being discussed. It also leaves the format of the
perfdata free to be adapted to the program that will process this data.

This leaves me with the specification of the thresholds. The developer
guidelines are mostly clear, but just to make sure: how do I specify a
warning below 10 and a critical above 45 ? For counters having a known
range, this is clear, but what do I do with a signed counter value, when
I don't know the possible minimum and maximum values?

Peter.

-----Original Message-----
From: Voon, Ton [mailto:Ton.Voon at egg.com] 
Sent: dinsdag 15 juli 2003 15:53
To: Hoogendijk, Peter; Karl DeBisschop
Cc: NagiosPlug Devel
Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines


Peter,

Firstly, just want to say thank you for your contribution. This is a
fascinating thread. I much rather have this discussion now than it
raised as design problems afterwards!

Good point about the two different types of plugins. I think we are
starting to nail down "homegrown" plugins, so I think that will be
finalised soon.

Regarding plugins through indirect checks, I think there has to be a
level of translation - it's just trying to work out where.

To get RRD graphs (for homegrown or indirect plugins), I think there are
4 generic steps:
1) Perf data returned by the plugin
2) Data stored (in db or file)
3) Extracts the perf data into an RRD 
4) Draw the graph

Given that indirect plugins return their performance metrics in
different formats, there needs to be a translation at some point. If the
plugins just return whatever the result from the lookup, then the
translation needs to happen at step 2 or 3. The advantage is that the
code is only held only once (instead of check_snmp and check_nt). The
disadvantage is you will not get useful data like what the thresholds
were. 

I propose that the translation happen at the plugin - step 1. So, from
your example, 

'\System\System Up Time'='15693 sec'

is returned as 

'\System\System Up Time'=15693s

(Or whatever we decide the format of the performance data will
eventually
be.)

Re the labels, looking at RRD's manual, it says labels must be between
1-19 chars in the class [a-zA-Z0-9_], which is going to make your
example v difficult. I say RRD has the limitation, so keep it like the
example and let step 3 handles the conversion for RRD (if a different
graphing program was used, there may not be the same limitation).

Does this make sense? Any further comments?

Ton

> -----Original Message-----
> From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com]
> Sent: Monday, July 14, 2003 2:44 PM
> To: Karl DeBisschop; Voon, Ton
> Cc: NagiosPlug Devel
> Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines
> 
> 
> Karl, Ton,
> 
> I have been thinking about this during the weekend. In my
> opinion there
> are two types of plugins:
> 
>   1) Plugins that perform a specific (direct) check and return a 
> specific answer. In this case you (the author of the plugin) can make 
> an exact choice about both the plugin output and the performance data
> format.
> 
>   2) Plugins that perform a lookup (indirect) check and return (an
> interpretation) of the result. This is the case with plugins checking 
> SNMP or the Microsoft Windows Perfmon data.
> 
> This second type of plugin is causing the problems. Karl remarks that 
> 'spaces in attributes seem avoidable', but looking at the results 
> returned by Microsoft Windows Perfmon, we see a lot of objects 
> counters and results with spaces:
> 
>   '\System\System Up Time'='15693 sec'
> 
> We could decide to remove the spaces, or replace them by underscores, 
> but this makes the whole process less transparent. As a result, I 
> prefer a set of guidelines that allows for strings containing any 
> characters. To summarize the questions I came up with while defining 
> the output/perfdata format for a lookup (indirect) plugin:
> 
> - Do I use single quotes or double quotes?
> - How do I escape this character if it exists in a string?
> - Do I use spaces or comma's to separate the data?
> 
> I myself prefer to use single quotes as used in mySql queries: put 
> single quotes around the string and double any single quotes in the 
> string itself. For the seperating character I have no
> preference: I just
> used the character as proposed in the 'Performance Data'
> chapter of the
> Nagios documentation.
> 
> Peter.
> 
> P.S. If the strings themself contain spaces, but don't contain '=' 
> characters or seperator characters, the quotes aren't even needed!
> 
> 
> -----Original Message-----
> From: Karl DeBisschop [mailto:karl at debisschop.net]
> Sent: vrijdag 11 juli 2003 06:38
> To: Voon, Ton
> Cc: Hoogendijk, Peter; NagiosPlug Devel
> Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines
> 
> 
> On Thu, 2003-07-10 at 10:30, Voon, Ton wrote:
> 
> > I like the idea of quoting the attributes/values, but I don't think
> > they will be necessary if we get the standard attributes and their 
> > values right.
> 
> I agree somewhat - spaces in attributes especially seem avoidable.
> 
> > I think perfdata should be space separated data (just to save
> > processing), but I'm happy to take a consensus. Comma separated may 
> > make it a bit easier to parse visually. Any other opinions?
> 
> While spaces in attributes seem avoidable, I am less sure about spaces

> in values. I could imagine a plugin where the perf data was a string 
> from a SNMP OID, where we would not really have control over what was 
> in that string.
> 
> > Based on my guidelines, an example output of check_ping would be:
> > 
> > PING OK - Packet loss = 0%, RTA = 1.96 ms|pct=0 time=1.96
> 
> Why do we not allow the plugin perf data to return units like:
> 
>   PING OK - Packet loss = 0%, RTA = 1.96 ms|loss=0%,time=1.96 ms
> 
> I only ask because there are implementations of ping that can return 
> 'us' instead of 'ms' - I've alwys felt things are less likely to get 
> confused if you keep units explicit (juat ask NASA and the mars lander

> team).
> 
> > Three things that spring to mind:
> > - it's a bit shorter!
> 
> Short is good. But not so good that reliability, accuracy, or
> reasonable
> clarity should be sacrificed.
> 
> > - time means something different from check_http, check_tcp, etc.
> > Those mean "time taken to do a check". For check_ping, it 
> would mean
> > average time for a packet
> 
> Hense the idea of allowing units
> 
> > - pct is at 0, which is a "good" result (0% packet loss). However -
> > according to my proposal - check_disk would return pct=5 
> for 5% free
> > on total disk, which, as it gets closer to 0%, would be
> "bad". Maybe
> > it should be reversed, so pct=100% to mean no packet loss -
> should 0%
> > always be considered the worst case? This may not be easy
> for "number"
> 
> > attributes.
> 
> If you allow units, check_disk could return either
> 
>   DISK OK [6390 MB (42%) free on /]|free=42%
> 
> or
> 
>   DISK OK [6390 MB (42%) free on /]|used=58%
> 
> And I would suggest the latter.
> 
> > As you can see, it is hard to standardise on what the
> values actually
> > tell you. This is what I meant by "Why the returned values
> are bad is
> > then up to interpretation (and that is the key to any performance
> > analysis!)". However, what the guidelines will do is allow the RRD 
> > generation to happen easier.
> >
> > > From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com]
> > >
> > > We are in the process of developing a plugin to check information
> > > collected by another datacollection system. Based on the 
> > > 'Performance Data' chapter in the Nagios documentation, 
> we decided
> > > on comma-separated 'name=value' pairs. As we want to be able to
> > > transparently support the names and values used by the 
> other system,
> 
> > > both the name and the value part can optionally be quoted (with
> > > either single or double quotes). The
> > > result is:
> > > 
> > > 	Plugin Output|name1=value1, 'name 2'=value2, name3='11"',
> > > name4="Peter's PC"
> > > 
> > > To check our procedures for processing the performance
> data, I also
> > > modified the check_ping plugin. It now reports:
> > > 
> > > 	PING OK - Packet loss = 0%, RTA = 1.96 ms|"Packet loss"=0%
> > > RTA="1.96 ms"
> > > 
> > > The problem we are facing with this format is indeed the 
> > > interpretation by RRD (or in our case the script that's feeding 
> > > RRD), so we are open for suggestions. Your proposed guideline at 
> > > least seems to help us find the right direction.
> > >
> > > > From: Voon, Ton [mailto:Ton.Voon at egg.com]
> > > > 
> > > > One of the features required for 1.4 is performance
> data. I would
> > > > like to write up the guidelines for this, but wanted
> confirmation
> > > > if this is the right way to go, so any comments would be
> > > > appreciated.
> 
> Ton - thanks for kicking this off - sorry I was unable to respond 
> immediately.
> 
> > > > I think perf data should have/be:
> > > > 
> > > > - short labels
> > > > - generic and common labels across plugins if possible
> > > > - comma separated, no spaces. Regex format:
> > > > [a-z0-9]+=[0-9]?\.?[0-9]+
> > > > - redundant data removed (eg, if check_disk returns pct 
> and number
> > > > (free), can calculate used bytes)
> > > > 
> > > > My suggestion for labels are:
> > > > 
> > > > Name ; Units ; printf format ; Details
> > > > time ; seconds ; %.3f ; time taken to do a specific check (eg 
> > > > DNS query, HTTP request, ping RTA) pct ; percent ; %.3f ; 
> > > > percentage (free
> rather
> > > > than used if applicable) (eg total disk, total swap, ping
> > > > percent loss)
> > > > number ; must be bytes if applicable ; %d ; a given number of
> things
> > > > (free rather than used if applicable) (eg processes,
> users, bytes
> used
> > > > such as total disk or total swap) numberf ; float ;
> %.3f ; a given
> > > > number of things that may be fractional (eg, load average,
> > > > average bytes
> > > > transmitted) counter ; a continuous counter (must be bytes if
> > > > applicable) ; %d ; a continuous counter (eg bytes transmitted on
> an
> > > > interface) load1 ; load ; %.2f ; load average over 1 min
> > > > load5 ; load ;
> > > > %.2f ; load average over 5 min load15 ; load ; %.2f ; load 
> > > > average over
> > > > 15 min
> > > > 
> > > > Contentious points:
> > > > - loadx. Not really keen on these, but don't seem to fit into 
> > > > any other labels, unless we only return load5 and use numberf
> > > > - taking free values rather than used. This is 
> consistent with the
> > > > output for check_disk and check_swap. Looking at graphs, I guess
> you
> > > > want to see it nearer zero which is your definite limit, rather
> than
> > > > continuously increasing
> > > > - maybe numberf is not required, but we say that number could be

> > > > fractional. I think this maybe better as RRD doesn't care 
> > > > whether values are integers or not
> > > > - too reductionalist? Would you prefer labels that describe 
> > > > the measure?
> > > > I think the labels should be generic and the plugin 
> describes the
> > > > context
> > > > 
> > > > As an example, the patches submitted on SF for
> check_ping had perf
> 
> > > > labels of rta and loss, but I think these should be
> time and pct
> > > > respectively. I think this makes it easier for
> something like RRD
> > > > to work out what type of value it is to draw the
> graphs. Why the
> > > > returned values are bad is then up to interpretation
> (and that is
> > > > the key to any performance analysis!).
> 
> --
> Karl
> 


This private and confidential e-mail has been sent to you by Egg. The
Egg group of companies includes Egg Banking plc (registered no.
2999842), Egg Financial Products Ltd (registered no. 3319027) and Egg
Investments Ltd (registered no. 3403963) which carries out investment
business on behalf of Egg and is regulated by the Financial Services
Authority.  
Registered in England and Wales. Registered offices: 1 Waterhouse
Square, 138-142 Holborn, London EC1N 2NA. If you are not the intended
recipient of this e-mail and have received it in error, please notify
the sender by replying with 'received in error' as the subject and then
delete it from your mailbox.



-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1
_______________________________________________
Nagiosplug-devel mailing list Nagiosplug-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagiosplug-devel
::: Please include plugins version (-v) and OS when reporting any issue.

::: Messages without supporting info will risk being sent to /dev/null




More information about the Devel mailing list