[Nagiosplug-help] Tracking down pthread/check_dns problem on CentOS4 w/ 1.4.2 plugins.
Ton Voon
ton.voon at altinity.com
Tue Nov 29 13:24:01 CET 2005
On 29 Nov 2005, at 16:40, John P. Rouillard wrote:
>> Are you saying that if you run it 10 times, it is 100% successful?
>
> If I run "run_tests 10" 10 times, I get a 2 of the 10 element runs
> to fail on avergae, but I have had a run of 15 error free. I am just
> guessing, but it may be load related. If I pause between the runs, it
> seems less likely to happen. However I never had a run of 1000 pass.
>
>> I'm happy with increasing the number of iterations if it catches the
>> problem more of the time.
>
> While 1000 may be overkill, I am seeing a 50% detection of failure
> when running it in a while loop. The 10 iteration version is failing
> less often. I've didn't try 100 or 500.
>
> However I did a bit more testing. The results aren't reliable. I have
> had 20 runs of "run_test 10" fail in a row and 20 pass in a row. As
> the number passed to run_tests goes up, I have fewer passes, but no
> definate way of determining oif the problem exists. E.G. with
> a single run of "run_tests 500" I got the following distribution:
>
> 1 Success=372 Fail=128
> 1 Success=400 Fail=100
> 2 Success=496 Fail=4
> 1 Success=498 Fail=2
> 1 Success=499 Fail=1
> 14 Success=500 Fail=0
> 80% success. For a "run_tests 10", I get:
>
> 19 Success=10 Fail=0
> 1 Success=7 Fail=3
> 95% success or
>
> 2 Success=10 Fail=0
> 5 Success=5 Fail=5
> 3 Success=6 Fail=4
> 4 Success=7 Fail=3
> 6 Success=8 Fail=2
> 10% success or
>
> 5 Success=5 Fail=5
> 4 Success=6 Fail=4
> 4 Success=7 Fail=3
> 5 Success=8 Fail=2
> 2 Success=9 Fail=1
> 0% success.
>
> For a count of 1000 I got:
> 5 Success=1000 Fail=0
> 1 Success=780 Fail=220
> 1 Success=986 Fail=14
> 1 Success=990 Fail=10
> 1 Success=995 Fail=5
> 2 Success=996 Fail=4
> 6 Success=997 Fail=3
> 3 Success=999 Fail=1
> 25% success or
>
> 9 Success=1000 Fail=0
> 1 Success=833 Fail=167
> 1 Success=944 Fail=56
> 1 Success=990 Fail=10
> 1 Success=996 Fail=4
> 1 Success=997 Fail=3
> 2 Success=998 Fail=2
> 4 Success=999 Fail=1
> 45% success.
>
> Not sure if the data is of any use, but more runs seems to be better.
I agree this is a pain to detect. If there are any ideas on a better
test, I'm all ears.
What about running 100 x iterations of 10? If there is any failure,
break out and apply fix. If all 100 are okay, then assume system is
okay.
Ton
