[Nagiosplug-help] Usage of check_procs

Ralph.Grothe at itdz-berlin.de Ralph.Grothe at itdz-berlin.de
Tue Sep 11 17:03:00 CEST 2007


Dear Nagiosplug Users/Hackers,

I am currently puzzled about the intended correct usage of the
check_procs plugin.
The help screen of the plugin isn't all that helpful.
(I haven't yet looked at the implementation in the code)

Actually, I need to monitor a proc whose command name is known
beforehand,
as it is (due to an unfixed bug in the employed release?)
susceptible to hog
an entire CPU to 100% (platform is an HP9000 multi CPU server
with HP-UX 11.11)

>From the plugin's help screen I started like this:

$ /usr/local/nagios/libexec/check_procs -m CPU -w 5:10 -c 11:  
CPU CRITICAL: 339 crit, 0 warn out of 339 processes

But fetching from the proc table I get quite different results
(Ok, I acknowledge that check_procs might use another syscall
(maybe pstat()?)
But differences shouldn't be that blatant)

$ UNIX95= ps -e -o pid,ppid,uid,time,state,cpu,pcpu,comm|awk
'NR==1||$7>1'|sort -n -k 7,7
  PID  PPID        UID     TIME S  C  %CPU COMMAND
28985     1          0    25:22 S  0  1.37 saposcol
27337 27113        203 04:54:28 S  8  1.64 oracleZ01
27336 27113        203 02:57:16 S  8  2.55 oracleZ01
 6953     1        203    01:27 S  1  2.78 oracleZ01
17430     1        203    07:40 S 29  3.40 oracleZ01
14016     1        203    26:51 S  0  6.99 oracleZ01
29566     1        203    11:34 S 66 12.46 oracleZ01
27335 27113        203 14:12:13 R 67 22.81 oracleZ01
27334 27113        203 14:21:29 S 64 22.93 oracleZ01


Maybe I forgot the % units specifier?
But no difference

$ /usr/local/nagios/libexec/check_procs -m CPU -w 5:10% -c 11:%

CPU CRITICAL: 337 crit, 0 warn out of 337 processes

Well, at least the proc count seems right ;-)

$ UNIX95= ps -e -o pid=|wc -w
328

Then I tried the ominous -P swtch.
But I cannot fathom why than the (mandatory) warn and crit ranges
still are necessary?
Anyway, no difference.

$ /usr/local/nagios/libexec/check_procs -m CPU -P 1 -w 5:10% -c
11:%
CPU OK: 0 processes with PCPU >= 1.00

But what I really want to achieve is, monitor this beast

$ UNIX95= ps -C dmisp -o pid,ppid,uid,time,state,cpu,pcpu,comm

  PID  PPID        UID     TIME S  C  %CPU COMMAND
 1347     1          0    02:17 R  0  0.21 dmisp

As can be seen, now it's behaving, but it eventually will grab
100%

So I tried this 

$ /usr/local/nagios/libexec/check_procs -m CPU -C dmisp -w 5:10%
-c 11:%
CPU CRITICAL: 1 crit, 0 warn out of 1 process with command name
'dmisp'

Why critical when still down at 0.21% ?

This also makes no sense

$ /usr/local/nagios/libexec/check_procs -m CPU -C dmisp -P 0.2
CPU OK: 0 processes with command name 'dmisp', PCPU >= 0.20


Could anyone demistify check_procs to me and show its correct
usage 
to catch the cpu hog?

Regards

Ralph












More information about the Help mailing list