[Nagiosplug-devel] RFC: Performance data guidelines

Voon, Ton Ton.Voon at egg.com
Tue Jul 15 09:56:19 CEST 2003


Peter,

Firstly, just want to say thank you for your contribution. This is a
fascinating thread. I much rather have this discussion now than it raised as
design problems afterwards!

Good point about the two different types of plugins. I think we are starting
to nail down "homegrown" plugins, so I think that will be finalised soon.

Regarding plugins through indirect checks, I think there has to be a level
of translation - it's just trying to work out where.

To get RRD graphs (for homegrown or indirect plugins), I think there are 4
generic steps:
1) Perf data returned by the plugin
2) Data stored (in db or file)
3) Extracts the perf data into an RRD 
4) Draw the graph

Given that indirect plugins return their performance metrics in different
formats, there needs to be a translation at some point. If the plugins just
return whatever the result from the lookup, then the translation needs to
happen at step 2 or 3. The advantage is that the code is only held only once
(instead of check_snmp and check_nt). The disadvantage is you will not get
useful data like what the thresholds were. 

I propose that the translation happen at the plugin - step 1. So, from your
example, 

'\System\System Up Time'='15693 sec'

is returned as 

'\System\System Up Time'=15693s

(Or whatever we decide the format of the performance data will eventually
be.)

Re the labels, looking at RRD's manual, it says labels must be between 1-19
chars in the class [a-zA-Z0-9_], which is going to make your example v
difficult. I say RRD has the limitation, so keep it like the example and let
step 3 handles the conversion for RRD (if a different graphing program was
used, there may not be the same limitation).

Does this make sense? Any further comments?

Ton

> -----Original Message-----
> From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com] 
> Sent: Monday, July 14, 2003 2:44 PM
> To: Karl DeBisschop; Voon, Ton
> Cc: NagiosPlug Devel
> Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines
> 
> 
> Karl, Ton,
> 
> I have been thinking about this during the weekend. In my 
> opinion there
> are two types of plugins:
> 
>   1) Plugins that perform a specific (direct) check and return a
> specific answer. In this case you (the author of the plugin) 
> can make an
> exact choice about both the plugin output and the performance data
> format.
> 
>   2) Plugins that perform a lookup (indirect) check and return (an
> interpretation) of the result. This is the case with plugins checking
> SNMP or the Microsoft Windows Perfmon data.
> 
> This second type of plugin is causing the problems. Karl remarks that
> 'spaces in attributes seem avoidable', but looking at the results
> returned by Microsoft Windows Perfmon, we see a lot of 
> objects counters
> and results with spaces:
> 
>   '\System\System Up Time'='15693 sec'
> 
> We could decide to remove the spaces, or replace them by underscores,
> but this makes the whole process less transparent. As a 
> result, I prefer
> a set of guidelines that allows for strings containing any characters.
> To summarize the questions I came up with while defining the
> output/perfdata format for a lookup (indirect) plugin:
> 
> - Do I use single quotes or double quotes?
> - How do I escape this character if it exists in a string?
> - Do I use spaces or comma's to separate the data?
> 
> I myself prefer to use single quotes as used in mySql queries: put
> single quotes around the string and double any single quotes in the
> string itself. For the seperating character I have no 
> preference: I just
> used the character as proposed in the 'Performance Data' 
> chapter of the
> Nagios documentation.
> 
> Peter.
> 
> P.S. If the strings themself contain spaces, but don't contain '='
> characters or seperator characters, the quotes aren't even needed!
> 
> 
> -----Original Message-----
> From: Karl DeBisschop [mailto:karl at debisschop.net] 
> Sent: vrijdag 11 juli 2003 06:38
> To: Voon, Ton
> Cc: Hoogendijk, Peter; NagiosPlug Devel
> Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines
> 
> 
> On Thu, 2003-07-10 at 10:30, Voon, Ton wrote:
> 
> > I like the idea of quoting the attributes/values, but I don't think 
> > they will be necessary if we get the standard attributes and their 
> > values right.
> 
> I agree somewhat - spaces in attributes especially seem avoidable.
> 
> > I think perfdata should be space separated data (just to save 
> > processing), but I'm happy to take a consensus. Comma separated may 
> > make it a bit easier to parse visually. Any other opinions?
> 
> While spaces in attributes seem avoidable, I am less sure about spaces
> in values. I could imagine a plugin where the perf data was a string
> from a SNMP OID, where we would not really have control over 
> what was in
> that string.
> 
> > Based on my guidelines, an example output of check_ping would be:
> > 
> > PING OK - Packet loss = 0%, RTA = 1.96 ms|pct=0 time=1.96
> 
> Why do we not allow the plugin perf data to return units like:
> 
>   PING OK - Packet loss = 0%, RTA = 1.96 ms|loss=0%,time=1.96 ms
> 
> I only ask because there are implementations of ping that can return
> 'us' instead of 'ms' - I've alwys felt things are less likely to get
> confused if you keep units explicit (juat ask NASA and the mars lander
> team).
> 
> > Three things that spring to mind:
> > - it's a bit shorter!
> 
> Short is good. But not so good that reliability, accuracy, or 
> reasonable
> clarity should be sacrificed.
> 
> > - time means something different from check_http, check_tcp, etc. 
> > Those mean "time taken to do a check". For check_ping, it 
> would mean 
> > average time for a packet
> 
> Hense the idea of allowing units
> 
> > - pct is at 0, which is a "good" result (0% packet loss). However - 
> > according to my proposal - check_disk would return pct=5 
> for 5% free 
> > on total disk, which, as it gets closer to 0%, would be 
> "bad". Maybe 
> > it should be reversed, so pct=100% to mean no packet loss - 
> should 0% 
> > always be considered the worst case? This may not be easy 
> for "number"
> 
> > attributes.
> 
> If you allow units, check_disk could return either 
> 
>   DISK OK [6390 MB (42%) free on /]|free=42%
> 
> or
> 
>   DISK OK [6390 MB (42%) free on /]|used=58%
> 
> And I would suggest the latter.
> 
> > As you can see, it is hard to standardise on what the 
> values actually 
> > tell you. This is what I meant by "Why the returned values 
> are bad is 
> > then up to interpretation (and that is the key to any performance 
> > analysis!)". However, what the guidelines will do is allow the RRD 
> > generation to happen easier.
> >
> > > From: Hoogendijk, Peter [mailto:Peter.Hoogendijk at atosorigin.com]
> > >
> > > We are in the process of developing a plugin to check information 
> > > collected by another datacollection system. Based on the 
> > > 'Performance Data' chapter in the Nagios documentation, 
> we decided 
> > > on comma-separated 'name=value' pairs. As we want to be able to 
> > > transparently support the names and values used by the 
> other system,
> 
> > > both the name and the value part can optionally be quoted (with 
> > > either single or double quotes). The
> > > result is:
> > > 
> > > 	Plugin Output|name1=value1, 'name 2'=value2, name3='11"', 
> > > name4="Peter's PC"
> > > 
> > > To check our procedures for processing the performance 
> data, I also 
> > > modified the check_ping plugin. It now reports:
> > > 
> > > 	PING OK - Packet loss = 0%, RTA = 1.96 ms|"Packet loss"=0% 
> > > RTA="1.96 ms"
> > > 
> > > The problem we are facing with this format is indeed the
> > > interpretation by RRD (or in our case the script that's
> > > feeding RRD), so we are open for suggestions. Your proposed 
> > > guideline at least seems to help us find the right direction.
> > >
> > > > From: Voon, Ton [mailto:Ton.Voon at egg.com]
> > > > 
> > > > One of the features required for 1.4 is performance 
> data. I would 
> > > > like to write up the guidelines for this, but wanted 
> confirmation 
> > > > if this is the right way to go, so any comments would be 
> > > > appreciated.
> 
> Ton - thanks for kicking this off - sorry I was unable to respond
> immediately.
> 
> > > > I think perf data should have/be:
> > > > 
> > > > - short labels
> > > > - generic and common labels across plugins if possible
> > > > - comma separated, no spaces. Regex format: 
> > > > [a-z0-9]+=[0-9]?\.?[0-9]+
> > > > - redundant data removed (eg, if check_disk returns pct 
> and number
> > > > (free), can calculate used bytes)
> > > > 
> > > > My suggestion for labels are:
> > > > 
> > > > Name ; Units ; printf format ; Details
> > > > time ; seconds ; %.3f ; time taken to do a specific check (eg
> > > > DNS query,
> > > > HTTP request, ping RTA) pct ; percent ; %.3f ; percentage (free
> rather
> > > > than used if applicable) (eg total disk, total swap, ping 
> > > > percent loss)
> > > > number ; must be bytes if applicable ; %d ; a given number of
> things
> > > > (free rather than used if applicable) (eg processes, 
> users, bytes
> used
> > > > such as total disk or total swap) numberf ; float ; 
> %.3f ; a given
> > > > number of things that may be fractional (eg, load average, 
> > > > average bytes
> > > > transmitted) counter ; a continuous counter (must be bytes if
> > > > applicable) ; %d ; a continuous counter (eg bytes transmitted on
> an
> > > > interface) load1 ; load ; %.2f ; load average over 1 min 
> > > > load5 ; load ;
> > > > %.2f ; load average over 5 min load15 ; load ; %.2f ; load 
> > > > average over
> > > > 15 min
> > > > 
> > > > Contentious points:
> > > > - loadx. Not really keen on these, but don't seem to fit into
> > > > any other
> > > > labels, unless we only return load5 and use numberf
> > > > - taking free values rather than used. This is 
> consistent with the
> > > > output for check_disk and check_swap. Looking at graphs, I guess
> you
> > > > want to see it nearer zero which is your definite limit, rather
> than
> > > > continuously increasing
> > > > - maybe numberf is not required, but we say that number could be
> > > > fractional. I think this maybe better as RRD doesn't care 
> > > > whether values
> > > > are integers or not
> > > > - too reductionalist? Would you prefer labels that describe 
> > > > the measure?
> > > > I think the labels should be generic and the plugin 
> describes the
> > > > context
> > > > 
> > > > As an example, the patches submitted on SF for 
> check_ping had perf
> 
> > > > labels of rta and loss, but I think these should be 
> time and pct 
> > > > respectively. I think this makes it easier for 
> something like RRD 
> > > > to work out what type of value it is to draw the 
> graphs. Why the 
> > > > returned values are bad is then up to interpretation 
> (and that is 
> > > > the key to any performance analysis!).
> 
> --
> Karl
> 


This private and confidential e-mail has been sent to you by Egg.
The Egg group of companies includes Egg Banking plc
(registered no. 2999842), Egg Financial Products Ltd (registered
no. 3319027) and Egg Investments Ltd (registered no. 3403963) which
carries out investment business on behalf of Egg and is regulated
by the Financial Services Authority.  
Registered in England and Wales. Registered offices: 1 Waterhouse Square,
138-142 Holborn, London EC1N 2NA.
If you are not the intended recipient of this e-mail and have
received it in error, please notify the sender by replying with
'received in error' as the subject and then delete it from your
mailbox.





More information about the Devel mailing list