Checks In Depth
All NSClient++ checks — check_cpu, check_drivesize, check_service, check_eventlog, and dozens more — share one
common engine. Once you understand it, you can configure any check.
This page walks through that engine from the ground up. Read it top-to-bottom the first time; later, jump to the section you need.
New to NSClient++?
Try the Quick Start or a Monitoring Scenario first to get a feel for what checks look like in practice.
1. How a Check Works
Every check follows the same five steps:
flowchart LR
A[Collect items] --> B[Filter]
B --> C[Evaluate<br/>warn / crit]
C --> D[Format<br/>top / detail-syntax]
D --> E[Return<br/>status + perfdata]
- Collect — gather a list of items (CPU cores, drives, services, event log entries…).
- Filter — keep only items matching the
filterexpression. - Evaluate — test each kept item against
warnandcrit. - Format — build the message using
top-syntaxanddetail-syntax. - Return — return the worst status, the message, and the performance data.
Each step has its own option. The four most important are:
| Option | Purpose |
|---|---|
filter |
Which items are included |
warn / crit |
Which items trigger an alert |
top-syntax / detail-syntax / ok-syntax |
What the message looks like |
perf-config / perf-syntax |
What performance data looks like |
2. Trying It Out (Test Mode)
The fastest way to learn the engine is to drive a check yourself. NSClient++ ships with a built-in test shell that runs the same modules the service does, but interactively, with all log output visible.
Start the shell
nscp test --settings dummy
...
L client Enter command to inject or exit to terminate...
--settings dummy skips your real configuration so nothing on disk surprises you. Type exit to leave.
Load a module
Out of the box the shell has no checks loaded. Each check lives in a module — load it once per session:
load CheckSystem
Now check_cpu, check_memory, check_uptime, etc. are available:
check_cpu
L client OK: CPU Load ok
L client Performance data: 'total 5m'=0%;80;90 'total 1m'=1%;80;90 'total 5s'=11%;80;90
See what's going on (debug logs)
Trace and debug log lines are hidden by default. Two ways to turn them on:
nscp test --settings dummy --log debug
Use level = trace for maximum verbosity (incoming requests, child-process spawn/exit, response
metadata).
Argument syntax
Options are passed as keyword=value, with quotes when the value contains spaces or shell-meta
characters:
check_cpu filter=none
check_cpu "filter=core = 'total'" "warn=load > 80"
Some keywords are flags with no value (e.g. help, show-all, show-default).
3. Two Indispensable Built-in Helpers
Before changing anything, learn these two commands. They work for every check.
Show defaults
check_cpu show-default
"filter=core = 'total'" "warning=load > 80" "critical=load > 90"
"empty-state=ignored" "top-syntax=${status}: ${problem_list}"
"ok-syntax=%(status): CPU load is ok." "detail-syntax=${time}: ${load}%"
"perf-syntax=${core} ${time}"
Use this to learn what a check does by default and to copy a single option to tweak.
Warning
Don't paste all defaults into your config. Defaults can change in newer versions; pinning them removes that benefit.
Show help
check_cpu help
Gives the full list of options, filter keywords, and example usage for that specific check.
4. A Quick End-to-End Example
The same check_cpu, with progressively more customisation:
# Defaults
check_cpu
OK: CPU load is ok.
'total 5m'=2%;80;90 'total 1m'=5%;80;90 'total 5s'=11%;80;90
# Custom thresholds
check_cpu "warn=load > 50" "crit=load > 70"
OK: CPU load is ok.
'total 5m'=2%;50;70 ...
# Per-core data instead of just the total
check_cpu filter=none "warn=load > 80" "crit=load > 90"
# Custom message
check_cpu "top-syntax=%(status): CPU usage is %(list)" \
"detail-syntax=%(time) avg: %(load)%"
OK: CPU usage is 5m avg: 2%, 1m avg: 5%, 5s avg: 11%
The rest of this page explains exactly what each of those options does.
5. Filters — Choosing What to Check
A filter is an expression evaluated for each item. Items where it is true are included; items where it is false
are dropped.
If you don't supply one, the check uses its default (e.g., check_cpu defaults to core = 'total').
Syntax
keyword operator value
Combine with and, or, not. Disable the default with filter=none.
check_cpu "filter=core = 'total'"
check_cpu "filter=load > 5 or core = 'total'"
check_service "filter=start_type = 'auto' and not name like 'clr_optimization'"
check_cpu filter=none # include everything
Operators
| Symbol | Safe alias | Meaning |
|---|---|---|
= |
eq |
Equals |
!= |
ne |
Not equals |
> < >= <= |
gt lt ge le |
Numeric comparison |
like |
like |
Substring match |
regexp |
regexp |
Regular expression |
in |
in |
Membership in a list |
and or not |
— | Logical |
'...' |
str(...) |
String literal |
Note
Use the safe aliases (gt, lt, …) when passing arguments through NRPE or shells — they avoid </> redirection
problems.
Common keywords
These are available in every check:
| Keyword | Meaning |
|---|---|
count |
Items matching the filter |
total |
Items before filtering |
ok_count / warn_count / crit_count |
Items in each state |
problem_count |
Warning + critical |
status |
Current overall status |
Each check adds its own keywords. For check_cpu: core, core_id, load, idle, kernel, time. Use
<check> help to discover them.
Practical filter recipes
# CPU: only the aggregate total
check_cpu "filter=core = 'total'"
# CPU: cores actually doing work, plus the total
check_cpu filter=none "filter=load > 5 or core = 'total'"
# Services: only auto-start, exclude one by name
check_service "filter=start_type = 'auto' and name != 'Spooler'"
# Event log: errors and criticals only
check_eventlog "filter=level in ('error', 'critical')"
# Disk: fixed and network drives
check_drivesize drive=* "filter=type in ('fixed', 'remote')"
Size and time units
| Suffix | Size | Time |
|---|---|---|
k / m / g / t |
KB / MB / GB / TB | — |
s / m / h / d / w |
— | seconds / minutes / hours / days / weeks |
check_memory "warn=free < 4g"
check_uptime "warn=uptime < 1d"
Past times use negative values — -1h means "less than an hour ago":
check_eventlog scan-range=-24h "crit=written > -1h"
6. Thresholds — Choosing What's a Problem
warn and crit use the same expression language as filters. The difference: instead of including/excluding items,
they decide which items count as alerts.
- Any item matches
crit→ result is CRITICAL - Any item matches
warn(and none matchedcrit) → WARNING - Otherwise → OK
check_cpu "warn=load > 80" "crit=load > 90"
check_memory "warn=free < 20%" "crit=free < 10%"
check_drivesize "warn=free < 15%" "crit=free < 5%"
Disabling a threshold
check_cpu warning=none
check_cpu critical=none
Composite thresholds
# Warn on high kernel time OR high load
check_cpu filter=none "warn=kernel > 10 or load > 80" "crit=load > 90"
# Warn only on big machines that are tight on memory
check_memory "warn=free < 4g and size > 16g"
Aggregate thresholds
The count family of keywords lets you alert on totals rather than individual items:
# Alert if more than 3 services are stopped
check_service "crit=problem_count > 3"
# Alert if any matching event was written in the last hour
check_eventlog scan-range=-1w "crit=written > -1h"
empty-state — when nothing matches
What status to return when the filter selects no items:
| Value | Meaning |
|---|---|
ok |
Return OK (default for most checks) |
warning |
Return WARNING |
critical |
Return CRITICAL |
ignored |
Suppress the result |
check_service "filter=name = 'NonExistentService'" empty-state=ok
7. Output Syntax — Choosing the Message Text
Three options shape the message. They affect only the human-readable text — never the status or perfdata.
| Option | When applied | Default purpose |
|---|---|---|
top-syntax |
Always — the whole message | Status + list of problems |
detail-syntax |
Per item, inside the list | Per-item values |
ok-syntax |
When status is OK | Brief "all ok" message |
check_cpu defaults:
top-syntax=${status}: ${problem_list}
ok-syntax=%(status): CPU load is ok.
detail-syntax=${time}: ${load}%
Template variables
Variables can be written ${name} or %(name) — they're equivalent.
Note
On Unix shells ${...} may be eaten by the shell. Prefer %(...) when the command travels through NRPE or a shell.
Common variables (every check):
| Variable | Meaning |
|---|---|
${status} |
OK / WARNING / CRITICAL / UNKNOWN |
${list} |
All filtered items, joined with detail-syntax |
${problem_list} |
Only warning/critical items |
${ok_list} / ${warn_list} / ${crit_list} |
Items in each state |
${count} / ${problem_count} |
Item counts |
Per-item variables (in detail-syntax) depend on the check — <check> help lists them.
Recipes
# Show all values, not just problems
check_cpu show-all
# equivalent to:
check_cpu "top-syntax=%(status): %(list)"
# Custom CPU message
check_cpu time=5m \
"top-syntax=%(status): Cpu usage is %(list)" "detail-syntax=%(load)%"
OK: Cpu usage is 26%
# Custom memory message
check_memory "top-syntax=${list}" \
"detail-syntax=${type} free: ${free} used: ${used} size: ${size}"
page free: 16G used: 7.98G size: 24G, physical free: 4.18G used: 7.8G size: 12G
# Service: name and state for each
check_service "top-syntax=${list}" "detail-syntax=${name}: ${state}"
8. Performance Data
Performance data is the machine-readable metrics used for graphing, in the standard Nagios format:
'metric_name'=value[unit];[warn];[crit];[min];[max]
Example:
'total 5m'=2%;80;90 'total 1m'=5%;80;90 'total 5s'=11%;80;90
Perfdata needs thresholds
Without warn/crit, perfdata values are emitted but empty:
check_cpu warning=none critical=none
'total 5m'= 'total 1m'= 'total 5s'=
To get values without alerts, use perf-config to mark metrics as "extra":
check_cpu warning=none critical=none "perf-config=extra(load)"
Customising with perf-config
perf-config works like a tiny stylesheet — selectors target metrics, keys transform them.
"perf-config=selector(key:value; key:value) selector2(key:value)"
| Key | Effect |
|---|---|
unit |
Force a unit (G, M, K, %, ms, …) |
ignored |
true → drop this metric |
prefix / suffix |
Rename parts of the metric name |
minimum / min |
Force the perfdata min field (use min as a shorthand) |
maximum / max |
Force the perfdata max field — useful for graphing systems that auto-fit |
Selectors match in order of specificity: prefix.object.suffix → prefix.object → object.suffix → prefix →
suffix → object. The * selector matches everything.
Recipes
# Lock memory metrics to GB (avoids graph jumps when auto-scaling switches units)
check_memory "perf-config=*(unit:G)"
# Lock disk metrics to GB
check_drivesize "perf-config=*(unit:G)"
# Drop the percent metrics from check_drivesize, keep the absolute ones
check_drivesize "perf-config=used %(ignored:true)"
# Rename: drop suffix label, force GB
check_drivesize "perf-config=used.used(unit:G;suffix:'') used %(ignored:true)"
'C:\'=213G;178;201;0;223 'D:\'=400G;372;419;0;465
# Force min/max bounds on a counter that doesn't know its own range
# (e.g. a raw PDH counter exposed by check_pdh). The graphing system can
# then auto-scale to the declared range instead of guessing from history.
check_pdh "counter=\\Processor(_Total)\\% Processor Time" \
"perf-config=*(minimum:0;maximum:100)"
# Same idea on a custom queue-depth counter, with `min`/`max` shorthand.
check_pdh counter=queue_depth "perf-config=queue_depth(min:0;max:12345)"
Inspecting performance data
render_perf remove-perf command=check_drivesize
OK: OK:
C:\ used 213.605 GB 178.777 201.124 223.471 0
C:\ used % 95 % 79 89 100 0
perf-syntax — naming metrics
perf-syntax controls the metric name (not its value), using the same template variables as detail-syntax:
check_cpu "perf-syntax=${core} ${time}" # 'total 5m'=...
check_cpu "perf-syntax=${core}_${time}" # 'total_5m'=...
Useful when your graphing system is picky about names.
9. Putting It Together
Pick a check, run show-default, identify the option you want to change, change just that one. Repeat.
# Default behaviour
check_drivesize
# Step 1 — only fixed disks
check_drivesize "filter=type = 'fixed'"
# Step 2 — tighter thresholds
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5"
# Step 3 — clean message
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5" \
"top-syntax=%(status): %(list)" \
"detail-syntax=%(drive_or_id) %(free) free of %(size)"
# Step 4 — graph-friendly perfdata
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5" \
"perf-config=*(unit:G) used %(ignored:true)"