[Nagiosplug-devel] [ nagiosplug-Patches-2784928 ] check_disk no longer hangs on hanging filesystems

SourceForge.net noreply at sourceforge.net
Fri May 1 12:06:18 CEST 2009


Patches item #2784928, was opened at 2009-05-01 12:06
Message generated for change (Tracker Item Submitted) made by lausser
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397599&aid=2784928&group_id=29880

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Enhancement
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: gerhard lausser (lausser)
Assigned to: Nobody/Anonymous (nobody)
Summary: check_disk no longer hangs on hanging filesystems

Initial Comment:
Hi,
i created a patch for check_disk (v2025/1.4.13) which can handle hanging nfs filesystems. Imagine you mounted a share from a NAS at the mountpoint /mnt. Now if the Storage device or whatever acts as NFS server dies or encounters a network problem, you will see messages like "NFS server nas.naprax.de not responding still trying" and every process accessing files inside the /mnt directory will be blocked, maybe forever. Depending on the mount options the hanging processes may even be invulnerable to a kill -9. This also applies to check_disk. If you have a service monitoring usage of /mnt with "check_disk ... -p /mnt", it will also be blocked. Nagios will report a timeout then. But the bad thing is, every <check_interval> minutes another check_disk will be started which also will hang then. Sooner or later your process list fills up with unkillable check_disks.
The critical piece of code inside check_disk is the stat system call, which is in the moment needed to find out, if a path exists at all. If that stat call hits a directory which is mounted from a dead nfs server, it will not return with an error code, but will not return at all.
I found out that although processes cannot be killed in such a situations, threads can. So i rewrote the stat_path subroutine in a way, where the critical stat is executed in it's own thread. If this thread does not terminate within the --timeout interval, it is considered to be blocked by a dead nfs filesystem and the thread will be detached.
I tested it on Linux 2.6.18 (gcc 4.1.2) and Solaris 10/x86 (gcc 3.4.3)
# mount -ohard,nointr nas.naprax.de:/mnt/md1/db2 /mnt 
$ check_disk -w 2G -c 1G -p /mnt
DISK OK - free space: /mnt 67815 MB (7% inode=99%);| /mnt=823060MB;938549;938550;0;938551
Then i switched off the nas device:
$ check_disk -w 3G -c 2G -p /mnt 
DISK CRITICAL - /mnt hangs: Timeout
$ check_disk -t 5 -w 3G -c 1G -p /mnt/rollout-p
DISK CRITICAL - /mnt/rollout-p hangs: Timeout

real    0m5.013s
user    0m0.001s
sys     0m0.003s

The patch includes modifications to 
plugins/check_disk.c : include pthread.h, prototype for do_stat_path, the old stat_path was renamed to do_stat_path and the new stat_path uses the thread trick (and calls the old stat_path code inside a thread)
plugins/Makefile.am : added -lpthread to check_disk_LDADD
configure.in : check for presence of libpthread,pthread.h

I hope this is useful to you.

Gerhard

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397599&aid=2784928&group_id=29880




More information about the Devel mailing list