[Nagiosplug-help] Tracking down pthread/check_dns problem on CentOS4 w/ 1.4.2 plugins.
ton.voon at altinity.com
Mon Nov 28 14:18:03 CET 2005
On 28 Nov 2005, at 17:26, John P. Rouillard wrote:
> Hello all:
> I am running CentOS4, (RH Enterprise 4 public version) and I am
> seeing the
> nslookup returned error status
> problem. However the plugins I am using were compiled on this box. As
> Ton Voon said:
>> Are you using RedHat? There is a known problem with bind on RedHat
>> where the nslookup and dig commands do not exit correctly due to a
>> kernel pthread issue.
> CentOS is "close enough" I guess 8-(.
>> If you are using Redhat, this problem is fixed in nagios-plugins
>> 1.4.2, but you need to compile it yourself for the ./configure script
>> to pick up that your system has a problem and workaround it.
> Seems like it doesn't work for CentOS and the kernel I am
> running. Grepping through the sources for 1.4.2 doesn't show me a
> reference to the pthread bug or a work around for it in
> check_dns.c. However I came across the following Changelog entry:
> 2005-09-12 11:31 tonvoon
> * plugins/popen.c, Makefile.am, configure.in, config_test/
> config_test/child_test.c, config_test/run_tests: ECHILD
> error at
> waitpid on Red Hat systems (Peter Pramberger and Sascha
> - 1250191)
> A little more searching in plugins/popen.c turned up this segment of
> #ifdef REDHAT_SPOPEN_ERROR
> while (!childtermd);
> /* wait until SIGCHLD */
> Now looking at configure to see where REDHAT_SPOPEN_ERROR is defined I
> see it calling a grep "\.EL$" on "uname -r"'s output. The uname -r
> output is "2.6.9-22.0.1.ELsmp" so this test is not done.
Someone else had already pointed CentOS so the CVS version of
configure.in has this:
dnl Check for Redhat spopen problem
dnl Wierd problem where ECHILD is returned from a wait call in error
dnl Only appears to affect nslookup and dig calls. Only affects
dnl 2.6.9-11 (okay in 2.6.9-5). Redhat investigating root cause
dnl We patch plugins/popen.c
dnl Need to add smp because uname different on those. May need to check
dnl Fedora Core too in future
if echo $ac_cv_uname_r | egrep "\.EL(smp)?$" >/dev/null 2>&1 ; then
So this should catch your system.
> Correcting the configure script (deleted the $ closing achor) to allow
> the test to be run I see it calling make to run "config_test/run_tests
> 10". If I run run_tests with an argument of 1000, I get Success=993
> Fail=7 with "run_tests 10", I get a successfull completion better than
> 80% of the time leading to REDHAT_SPOPEN_ERROR being undefined.
Are you saying that if you run it 10 times, it is 100% successful?
I'm happy with increasing the number of iterations if it catches the
problem more of the time.
> Increasing the iterations and fixing the regexp so that
> REDHAT_SPOPEN_ERROR is defined in config.h does seem to have solved
> the problem. However:
>> Alternatively, Sascha Runschke has been working with Red Hat and it
>> has been fixed in hotfix-kernel-2.6.9-22.12.EL, which you can
>> probably request from them through your support contract.
> I think I am seeing this problem in a java based application as
> well. Searching through redhat's bugzilla hasn't lead me to the ticket
> for this fix, does anybody have the kernel patch or a ticket ID so I
> can see the actual problem and try to fix/verify it, or send it to the
> CentOS folks for inclusion in a release/patch?
What is the best way to specify what the fix from Red Hat is? I will
update the configure.in comments to reflect.
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
More information about the Help