[Nagiosplug-help] Tracking down pthread/check_dns problem on CentOS4 w/ 1.4.2 plugins.

Ton Voon ton.voon at altinity.com
Mon Nov 28 14:18:03 CET 2005


On 28 Nov 2005, at 17:26, John P. Rouillard wrote:

> Hello all:
>
> I am running CentOS4, (RH Enterprise 4 public version) and I am  
> seeing the
> dreaded:
>
>   nslookup returned error status
>
> problem. However the plugins I am using were compiled on this box. As
> Ton Voon said:
>
>> Are you using RedHat? There is a known problem with bind on RedHat
>> where the nslookup and dig commands do not exit correctly due to a
>> kernel pthread issue.
>
> CentOS is "close enough" I guess 8-(.
>
>> If you are using Redhat, this problem is fixed in nagios-plugins
>> 1.4.2, but you need to compile it yourself for the ./configure script
>> to pick up that your system has a problem and workaround it.
>
> Seems like it doesn't work for CentOS and the kernel I am
> running. Grepping through the sources for 1.4.2 doesn't show me a
> reference to the pthread bug or a work around for it in
> check_dns.c. However I came across the following Changelog entry:
>
> 2005-09-12 11:31  tonvoon
>
>         * plugins/popen.c, Makefile.am, configure.in, config_test/ 
> Makefile,
>           config_test/child_test.c, config_test/run_tests: ECHILD  
> error at
>           waitpid on Red Hat systems (Peter Pramberger and Sascha  
> Runschke
>           - 1250191)
>
> A little more searching in plugins/popen.c turned up this segment of
> code:
>
> #ifdef REDHAT_SPOPEN_ERROR
>         while (!childtermd);
>         /* wait until SIGCHLD */
> #endif
>
> Now looking at configure to see where REDHAT_SPOPEN_ERROR is defined I
> see it calling a grep "\.EL$" on "uname -r"'s output. The uname -r
> output is "2.6.9-22.0.1.ELsmp" so this test is not done.

Someone else had already pointed CentOS so the CVS version of  
configure.in has this:

dnl Check for Redhat spopen problem
dnl Wierd problem where ECHILD is returned from a wait call in error
dnl Only appears to affect nslookup and dig calls. Only affects  
redhat around
dnl 2.6.9-11 (okay in 2.6.9-5). Redhat investigating root cause
dnl We patch plugins/popen.c
dnl Need to add smp because uname different on those. May need to check
dnl Fedora Core too in future
if echo $ac_cv_uname_r | egrep "\.EL(smp)?$" >/dev/null 2>&1 ; then

So this should catch your system.

>
> Correcting the configure script (deleted the $ closing achor) to allow
> the test to be run I see it calling make to run "config_test/run_tests
> 10". If I run run_tests with an argument of 1000, I get Success=993
> Fail=7 with "run_tests 10", I get a successfull completion better than
> 80% of the time leading to REDHAT_SPOPEN_ERROR being undefined.

Are you saying that if you run it 10 times, it is 100% successful?

I'm happy with increasing the number of iterations if it catches the  
problem more of the time.

> Increasing the iterations and fixing the regexp so that
> REDHAT_SPOPEN_ERROR is defined in config.h does seem to have solved
> the problem.  However:
>
>> Alternatively, Sascha Runschke has been working with Red Hat and it
>> has been fixed in hotfix-kernel-2.6.9-22.12.EL, which you can
>> probably request from them through your support contract.
>
> I think I am seeing this problem in a java based application as
> well. Searching through redhat's bugzilla hasn't lead me to the ticket
> for this fix, does anybody have the kernel patch or a ticket ID so I
> can see the actual problem and try to fix/verify it, or send it to the
> CentOS folks for inclusion in a release/patch?

What is the best way to specify what the fix from Red Hat is? I will  
update the configure.in comments to reflect.

Ton




http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon






More information about the Help mailing list