[Nagiosplug-devel] [ nagiosplug-Patches-2794120 ] Detect a process that enters an infinite loop.

SourceForge.net noreply at sourceforge.net
Wed May 20 00:25:05 CEST 2009


Patches item #2794120, was opened at 2009-05-19 23:25
Message generated for change (Tracker Item Submitted) made by addw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397599&aid=2794120&group_id=29880

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Enhancement
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: alain williams (addw)
Assigned to: Nobody/Anonymous (nobody)
Summary: Detect a process that enters an infinite loop.

Initial Comment:
Motivation
**********

Detect a process that enters an infinite loop.
With today's multi process CPUs this might not be noticed for some time.

Discussion
**********

Taking a snapshot and detecting that a process is using 100% of a CPU can generate
false positives, eg catching the overnight batch job that is CPU intensive for a few minutes.
Using the nagios config option 'max_check_attempts' may also generate false positives
if different processes eat CPU in succession.

The patch saves notes about 'interesting processes' in a status file that is checked/updated by
successive runs of check_procs. This allows detection of processes that exceed the
limits over a long time. It records the start time of when individual process exceed
limits. Only processes that exceed a limit are deemed interesting and so have details
in the status file.

The start of Warning and Critical times are separate. Start of Critical implies Warning.
Processes are identified by PID. If the PPID or program change it is assumed to be a
different process (ie any history is forgotten).


Configuration
*************

resource.cfg may contain:

        # Somewhere that plugins can write to:
        $USER8$=/var/log/nagios


commands.cfg may contain:

        # 'check_cpu_hog_procs' command definition
        # Args:
        # 1     A unique name, used for the state file name in $USER8$ (with .status appended)
        # 2     How many minutes that an individual process needs to exceed a threshold for it to be reported
        # 3     Warning threshold
        # 4     Critical threshold
        define command{
                command_name    check_cpu_hog_procs
                command_line    $USER1$/check_procs --state-file=$USER8$/$ARG1$.status --state-time=$ARG2$ -w $ARG3$ -c $ARG4$ -m CPU
                }

The config file for the machine may contain (note the setting of max_check_attempts):

        # Detect CPU hogs
        # A process needs to exceed the limit for 15 minutes to be reported.
        # 80% CPU for warning. 90% CPU for critical.
        define service{
                use                             local-service         ; Name of service template to use
                max_check_attempts              1                     ; Report any change of state immediately, check_cpu_hog_procs already does exceeded time
                host_name                       mint.phcomp.co.uk.
                service_description             CPU Hog
                check_command                   check_cpu_hog_procs!cpu_hog!15!80!90
                }


Testing/Porting
***************

This has been written on a 64 bit x86 CentOS 5 machine and tested on similar machines.
No attempt to port it elsewhere has been made.

The patch is against: nagios-plugins-trunk-200905081200

Copyright
*********

I am content to assign copyright of this code to the Nagios Plugins Development Team
on the understanding that this will not prevent me from using the code that I have written
in any way that I wish (in this or another project).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397599&aid=2794120&group_id=29880




More information about the Devel mailing list