[Nagiosplug-help] Nagios Scheduler Hangs with Forking Issue since Upgrade from Nagios 2.9 to 3

Ralph.Grothe at itdz-berlin.de Ralph.Grothe at itdz-berlin.de
Mon Apr 20 11:44:57 CEST 2009


Dear List Subscribers,

though this doesn't look like a specific Nagios Plug-in issue to me
I hope that this may sound familiar to one of the many experts of this list.

Since I upgraded my Nagios from release 2.9 to

[root at nagsaz:~]
# /opt/nagios/bin/nagios -h|grep ^Nagios
Nagios 3.0.6

on 

[root at nagsaz:~]
# uname -sirv;cat /etc/redhat-release 
Linux 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21 EST 2007 i386
Red Hat Enterprise Linux Server release 5 (Tikanga)

it occurs that the nagios scheduler gets hung and fills the logfile with messages like, and nothing else
(no checks can then be scheduled anymore until nagios is manually restarted, which is unbearable)

[root at nagsaz:~]
# tail -3 /var/log/nagios/nagios.log 
[1240216261] Warning: fork() in my_system() failed for command "/usr/bin/perl /opt/nagios/nagiosgraph/insert.pl"
[1240216291] Warning: fork() in my_system() failed for command "/usr/bin/perl /opt/nagios/nagiosgraph/insert.pl"
[1240216321] Warning: fork() in my_system() failed for command "/usr/bin/perl /opt/nagios/nagiosgraph/insert.pl"

[root at nagsaz:~]
# grep -c fork\( /var/log/nagios/nagios.log 
1312
[root at nagsaz:~]
# wc -l /var/log/nagios/nagios.log 
1313 /var/log/nagios/nagios.log

But before the last restart of nagios I deactivated all perf data processing of all my services

[root at nagsaz:~]
# grep process_perf_data.*1 /opt/nagios/etc/objects/*_services.cfg 


So I first thought that the forking issue may be caused by too low a ulimit of the shell of the user nagios
under which the nagios process runs, viz. max procs and max open file descriptors per user.
But user nagios had only one process running on this box when it was hanging

[root at nagsaz:~]
# ps -fu nagios
UID        PID  PPID  C STIME TTY          TIME CMD
nagios   19276     1  2 Apr09 ?        06:30:30 /opt/nagios/bin/nagios -d /opt/nagios/etc/nagios.cfg

Otherwise I would have thought that I would have to explicitly set values for e.g. nproc, nofiles, locks in /etc/security/limits.conf

Are these user limits for nagios ok?
Currently I only have 270 host and abt. 825 service checks that need to be performed.

[root at nagsaz:~]
# su - nagios -c ulimit\ -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 8192
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited



As another cause I for a while suspected some embedded Perl issue 
I compiled Nagios with this option, and the p1.pl Perl program too got installed

But then there are only very few custom Perl plug-ins of which most are executed via nrpe on the monitored hosts anyway.
So the vast majority of my checks only rely on the compiled official Plug-ins, latest stable release.


The hanging of nagios always seems to build up with fork errors due to memory allocation problems like this

[1239746405] Warning: The check of service 'check-ha-log-warn' on host 'cebu' could not be performed due to a fork() error: 'Cannot allocate memory'.  The check will be rescheduled.


Memory usage with the hanging nagios looks like this

[root at nagsaz:~]
# free
             total       used       free     shared    buffers     cached
Mem:        515600     509420       6180          0      12628      57656
-/+ buffers/cache:     439136      76464
Swap:       514072     413876     100196

But according to e.g. vmstat there's no swapping going on, so the system does not seem to be thrashing.

Interestingly, when I attach a strace to the hanging nagios I can see open() errors of "Too many open files"
as if nofiles for user nagios was too low.


[root at nagsaz:~]
# strace -e trace=open -p $(pgrep -P1 nagios)
Process 19276 attached - interrupt to quit
open("/opt/nagios/var/nagios.tmpI4W7h1", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = -1 EMFILE (Too many open files)
open("/opt/nagios/var/nagios.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EMFILE (Too many open files)
open("/opt/nagios/var/spool", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1 EMFILE (Too many open files)
open("/opt/nagios/var/nagios.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EMFILE (Too many open files)
Process 19276 detached

[root at nagsaz:~]
# su - nagios -c ulimit\ -n
1024

It looks as if almost all 1024 file descriptors were used up for open spool files (probably pending spooled checks)

[root at nagsaz:~]
# lsof -nP -p $(pgrep -P1 nagios)|wc -l
1042
[root at nagsaz:~]
# lsof -nP -p $(pgrep -P1 nagios)|grep -c /opt/nagios/var/spool/
1019



Sorry, for the lengthy and flaky description.
Could someone possibly give me a clue what might be causing this?

Regards
Ralph







More information about the Help mailing list