[Nagiosplug-help] nrpe on solaris stops reporting exit code without nscd

Todd Fleisher todd at fleish.org
Fri Apr 11 21:44:08 CEST 2008


I use nrpe in a mixed environment. My nagios servers that run  
check_nrpe are Debian Linux and they poll a variety of systems running  
mostly Debian Linux or Solaris 10 Update 4 i86. The versions vary  
between 2.6 & 2.8.1, but I found my problem to be common to both -  
only on the Solaris platform. At a certain point, I found that  
although the text result of an nrpe check would report WARNING or  
CRITICAL, the exit code was always set to 0. The result was that  
nagios would not change the status field from OK to WARNING or  
CRITICAL, but would display the text that showed the check was WARNING  
or CRITICAL. This resulted in many missed notifications of alerts from  
Solaris machines.

Making matters worse was the fact that the problem wasn't consistent  
across the environment. Though all Solaris nodes are running identical  
versions of code, some would have the issue and others would not. In  
the end, I found that turning on the name-service-cache service (nscd)  
in Solaris fixed the issue. I then mentally envisioned the timeline of  
what must have happened:

	- We originally deployed Solaris & left nscd turned on
	- We installed & started nrpe
	- Sometime later we disabled nscd to keep Solaris from caching DNS  
information
	- nrpe continued to function until it was restarted
	- hosts that still had nrpe running from a long time ago when nscd  
was present were fine - while hosts where nrpe had been restarted or  
where nrpe had been newly installed on a system where nscd wasn't  
running experienced the issue

Now for the kicker, to fix the issue but keep Solaris from caching DNS  
information, I configured /etc/nscd.conf to disable caching for  
everything it claims to be able to cache for.  I then started the name- 
service-cache service and confirmed that DNS was not being cached.  
Here is an excerpt from /etc/nscd.conf

#       Currently supported cache names:
#               audit_user, auth_attr, bootparams, ethers
#               exec_attr, group, hosts, ipnodes, netmasks
#               networks, passwd, printers, prof_attr, project
#               protocols, rpc, services, tnrhdb, tnrhtp, user_attr
#
        logfile                 /var/adm/nscd.log

        enable-cache            hosts           no
        enable-cache            audit_user      no
        enable-cache            auth_attr       no
        enable-cache            bootparams      no
        enable-cache            ethers          no
        enable-cache            exec_attr       no
        enable-cache            group           no
        enable-cache            ipnodes         no
        enable-cache            netmasks        no
        enable-cache            networks        no
        enable-cache            passwd          no
        enable-cache            printers        no
        enable-cache            prof_attr       no
        enable-cache            project         no
        enable-cache            protocols       no
        enable-cache            rpc             no
        enable-cache            services        no
        enable-cache            tnrhdb          no
        enable-cache            tnrhtp          no
        enable-cache            user_attr       no

I then started nrpe, and the issue was gone. My next step is to truss  
the process to see if I can determine what's different in the 2  
scenarios. But I wanted to post this to see if others have experienced  
the same issue already. I couldn't find anything on the mailing list  
archives that matched.

Thanks,
Todd






More information about the Help mailing list