dermoth at aei.ca
Sat Feb 6 16:42:42 CET 2021
On 2019-11-27 08:45, Csaba Dobo wrote:
> I am investigating this plugin and would like to know the calculating
> |-w, --warning=WLOAD1,WLOAD5,WLOAD15 Exit with WARNING status if load
> average exceeds WLOADn -c, --critical=CLOAD1,CLOAD5,CLOAD15 Exit with
> CRITICAL status if load average exceed CLOADn the load average format
> is the same used by "uptime" and "w"|
> So when the system reports 3 values from ie. the uptime it would be
> red by the plugin. And what is the evaluation logic?
This plugin will simply return WARNING or CRITICAL when the load is
above the specified WARNING and CRITICAL thresholds. This number is
expressed as a floating point number. The plugin is very lax about
missing thresholds and it will behave as such:
1. Missing LOAD5 and/or LOAD15 value (for either threshold): back-fill
from the last given threshold value (LOAD1 or LOAD5)
2. Missing warning or critical value: assume 0 (probably not desired)
The load average is the average number of process on the runqueue for
the last 1, 5 and 15 minutes. That number include currently running
process as well as those scheduled to run (usually if greater than the
numbed of cpus/cores) and most importantly processes blocked in
interruptible sleep (ex. blocked on I/O).
On a purely CPU load, a number equal to the number of cores simply means
you're fully utilizing your system resources. Below it is
under-utilizing and above it you have processing contention. For I/O
load it depends on your I/O capacity and load average isn't the best way
to monitor specific block device usage (especially if you have multiple
devices as it doesn't tell you which one processes are blocked on).
Since load often consist of a mix between the two you have to determine
the right value for your specific load and it's usually best when
combined with other monitoring methods (like user/system CPU cycles,
context switch rate and per-device IO count/average service time). On
most system those metrics need a running daemon like sadc (systats) or
snmpd to collect as unlike load average they cannot be just read in an
instant (plugins that offers this will often just poll for a very short
time, between 500ms to 2 seconds, which isn't a representative value and
isn't scalable when you need to poll many thousands of machines).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Devel