check_ping: Add support for showing name (DNS) resolution in the plugin output (STDOUT). (#1284)
Mark A. Ziesemer
notifications at github.com
Sun Nov 30 00:26:07 CET 2014
I never expected so much reluctance at such a simple request. I did not
submit this as a bug report, looking for someone else to solve it - but as a
pull request for which I've already completed the development, am already
running the changes myself, and am simply hoping to contribute the
improvements back to the community.
I guess I still don't understand the concerns over an additional flag here -
especially when mentioned in the context of "inconsistently used options
across the plugins". There is no need to offer this as a single-character
flag - we could just stick with "--show-resolution" (or some other proposed
name), without needing to worry about any realistic conflict in what such an
option might mean in other plugins. I'm not particularly attached to the
implementation either (even though it was meant to be completely templated
off of existing code) - and if someone would like to implement this request
using an alternate implementation, that would be great as well.
This is not some new math or edge case. It is deemed useful, which is why
it is shown in the default output of all ping executables we've referenced
here. I also don't immediately see any other option for accomplishing the
same result - without patching the current plugin (or writing my own), hence
my persistence for following-up on this pull request. It is not as if this
is an ever-changing program for which any new changes will likely conflict
at some point. Since 2012, there have only been 3 changes here that weren't
related directly to the rename for the fork - 2 in January and 1 in April.
(However, I guess this will make it very easy for me to maintain my own copy
with these additions, if required.)
> I'd say, at least 90% use check_ping with the ip adress not with the
> hostname, so the use case is rather limited.
Just curious, are there gathered and/or published metrics for these types of
use cases anywhere? Regardless, I'd like to see any additional studies or
discussion around this monitoring design decision - monitoring by IP vs.
hostname. In my years of experience on a few different related monitoring
implementations here, we observed a number of significant misses on a system
that was still based on static IPs, while similar incidents were properly
caught when based on host names. They were arguably edge cases - but cases
that we'd expect such a monitoring system to be able to detect and report.
Things like the disappearance of a host name record from DNS, where the DNS
name is used for connections from other systems. Yes - in these incidents,
things that "never should happen" happened - and we want such monitoring to
be "synthetic" and to report on such failures.
Part of the reason that checking by IP rather than hostname may be so
prevalent could partially be due to these types of limitations in the
current tools and plugins. By providing proper support for both approaches,
I would hope the tide would shift to more checking by hostname - which I
sincerely believe to be a better approach. However, please don't take my
word for it - offer the support and choice to the users, and empower them to
make the appropriate determination for their specific needs.
Consider an integration with a partner, for whose IP addressing we have no
control. We depend upon a service being available at
the-service.extranet.some-company.com. It could be load-balanced with
round-robin DNS, or it could otherwise be statically-assigned with the need
to change hosting providers, etc. In the latter case, they could execute an
orderly transition - with proper TTLs on the DNS records, and services
available at both addresses for a matching transition period. However, when
configured to monitor by IP instead of by hostname, our monitoring now
potentially ends-up out-of-sync - or worse, a missed failure, as the check
ends-up monitoring an incorrect host. (Again, things that should "never
happen" if all other processes are properly followed, but not assumptions
we'd like to make here.)
Yes - this means that we are a bit more dependent on DNS or other name
resolution, and could be prone to disruptions or failures in those services.
However, I'd rather have a false-positive than a missed failure.
Additionally, using a local name resolution cache (e.g. dnsmasq) on the
monitoring server can provide for not flooding a configured DNS server with
hundreds of additional requests / minute, as well as providing for some
level of isolation to any intermittent upstream DNS failures (configurable
to degree and preference).
This leaves us with the potential cases that we observed a monitored
failure, or are following-up on a failure that was missed in monitoring -
and in either case, would like to know more details as to "why". A logical
first step for me here would involve determining which IP address was being
used - something that would otherwise have to be determined afterwards, and
even then, may not be readily available with any level of confidence
after-the-fact, especially for an intermittent failure.
Thank you for your continued consideration.
Reply to this email on GitHub:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Devel