[Nagiosplug-help] Tracking down pthread/check_dns problem on CentOS4 w/ 1.4.2 plugins.

John P. Rouillard rouilj at cs.umb.edu
Tue Nov 29 08:41:06 CET 2005

In message <D91212FE-417A-4B6E-BD1C-6652FC50788B at altinity.com>,
Ton Voon writes:
>On 28 Nov 2005, at 17:26, John P. Rouillard wrote:
>> Correcting the configure script (deleted the $ closing achor) to allow
>> the test to be run I see it calling make to run "config_test/run_tests
>> 10". If I run run_tests with an argument of 1000, I get Success=993
>> Fail=7 with "run_tests 10", I get a successfull completion better than
>> 80% of the time leading to REDHAT_SPOPEN_ERROR being undefined.
>Are you saying that if you run it 10 times, it is 100% successful?

If I run "run_tests 10" 10 times, I get a 2 of the 10 element runs
to fail on avergae, but I have had a run of 15 error free. I am just
guessing, but it may be load related. If I pause between the runs, it
seems less likely to happen. However I never had a run of 1000 pass.

>I'm happy with increasing the number of iterations if it catches the  
>problem more of the time.

While 1000 may be overkill, I am seeing a 50% detection of failure
when running it in a while loop. The 10 iteration version is failing
less often. I've didn't try 100 or 500.

However I did a bit more testing. The results aren't reliable. I have
had 20 runs of "run_test 10" fail in a row and 20 pass in a row. As
the number passed to run_tests goes up, I have fewer passes, but no
definate way of determining oif the problem exists. E.G. with
a single run of "run_tests 500" I got the following distribution:

      1 Success=372 Fail=128
      1 Success=400 Fail=100
      2 Success=496 Fail=4
      1 Success=498 Fail=2
      1 Success=499 Fail=1
     14 Success=500 Fail=0
80% success. For a "run_tests 10", I get:

     19 Success=10 Fail=0
      1 Success=7 Fail=3
95% success or

      2 Success=10 Fail=0
      5 Success=5 Fail=5
      3 Success=6 Fail=4
      4 Success=7 Fail=3
      6 Success=8 Fail=2
10% success or

      5 Success=5 Fail=5
      4 Success=6 Fail=4
      4 Success=7 Fail=3
      5 Success=8 Fail=2
      2 Success=9 Fail=1
0% success.

For a count of 1000 I got:
      5 Success=1000 Fail=0
      1 Success=780 Fail=220
      1 Success=986 Fail=14
      1 Success=990 Fail=10
      1 Success=995 Fail=5
      2 Success=996 Fail=4
      6 Success=997 Fail=3
      3 Success=999 Fail=1
25% success or

      9 Success=1000 Fail=0
      1 Success=833 Fail=167
      1 Success=944 Fail=56
      1 Success=990 Fail=10
      1 Success=996 Fail=4
      1 Success=997 Fail=3
      2 Success=998 Fail=2
      4 Success=999 Fail=1
45% success.

Not sure if the data is of any use, but more runs seems to be better.

>> Ton Voon said:
>>> Alternatively, Sascha Runschke has been working with Red Hat and it
>>> has been fixed in hotfix-kernel-2.6.9-22.12.EL, which you can
>>> probably request from them through your support contract.
>> I think I am seeing this problem in a java based application as
>> well. Searching through redhat's bugzilla hasn't lead me to the ticket
>> for this fix, does anybody have the kernel patch or a ticket ID so I
>> can see the actual problem and try to fix/verify it, or send it to the
>> CentOS folks for inclusion in a release/patch?
>What is the best way to specify what the fix from Red Hat is? I will  
>update the configure.in comments to reflect.

I would guess the bugzilla ID. I assume the bug ticket is publically
accessible. A link to the kernel patch wouldn't hurt either.

				-- rouilj
John Rouillard
My employers don't acknowledge my existence much less my opinions.

More information about the Help mailing list