Checks In Depth
All NSClient++ checks — check_cpu, check_drivesize, check_service, check_eventlog, and dozens more — share one
common engine. Once you understand it, you can configure any check.
This page walks through that engine from the ground up. Read it top-to-bottom the first time; later, jump to the section you need.
New to NSClient++?
Try the Quick Start or a Monitoring Scenario first to get a feel for what checks look like in practice.
1. How a Check Works
Every check follows the same five steps:
flowchart LR
A[Collect items] --> B[Filter]
B --> C[Evaluate<br/>warn / crit]
C --> D[Format<br/>top / detail-syntax]
D --> E[Return<br/>status + perfdata]
- Collect — gather a list of items (CPU cores, drives, services, event log entries…).
- Filter — keep only items matching the
filterexpression. - Evaluate — test each kept item against
warnandcrit. - Format — build the message using
top-syntaxanddetail-syntax. - Return — return the worst status, the message, and the performance data.
Each step has its own option. The four most important are:
| Option | Purpose |
|---|---|
filter |
Which items are included |
warn / crit |
Which items trigger an alert |
top-syntax / detail-syntax / ok-syntax |
What the message looks like |
perf-config / perf-syntax |
What performance data looks like |
2. Trying It Out (Test Mode)
The fastest way to learn the engine is to drive a check yourself. NSClient++ ships with a built-in test shell that runs the same modules the service does, but interactively, with all log output visible.
2.1. Start the shell
nscp test --settings dummy
...
L client Enter command to inject or exit to terminate...
--settings dummy skips your real configuration so nothing on disk surprises you. Type exit to leave.
Trace and debug log lines are hidden by default. Add --log debug (or --log trace for maximum
verbosity, including incoming requests and child-process spawn/exit) when you need to see them:
nscp test --settings dummy --log debug
2.2. Load a module
Out of the box the shell has no checks loaded. Each check lives in a module — load it once per session, then run the check:
load CheckSystem
2.3. Run a check
Now we can run the check:
check_cpu
L client OK: CPU Load ok
L client Performance data: 'total 5m'=0%;80;90 'total 1m'=1%;80;90 'total 5s'=11%;80;90
Options are passed as keyword=value, with quotes when the value contains spaces or shell-meta
characters. Some keywords are flags with no value (e.g. help, show-all, show-default):
check_cpu filter=none
check_cpu "filter=core = 'total'" "warn=load > 80"
2.4. Two essential helpers: show-default and help
Before changing anything, learn these two — they work for every check.
show-default prints the check's built-in options so you can learn what it does by default and copy
a single line to tweak:
check_cpu show-default
"filter=core = 'total'" "warning=load > 80" "critical=load > 90"
"empty-state=ignored" "top-syntax=${status}: ${problem_list}"
"ok-syntax=%(status): CPU load is ok." "detail-syntax=${time}: ${load}%"
"perf-syntax=${core} ${time}"
help gives the full list of options, filter keywords, and per-check examples:
check_cpu help
Note
You'll see ${name} in the defaults output. That's the older placeholder syntax, still supported
for compatibility. When you write your own configuration, prefer %(name) — see
section 5.
Warning
Don't paste all defaults into your config. Defaults can change in newer versions; pinning them removes that benefit.
2.5. A worked example
The same check_cpu, with progressively more customisation:
# Defaults
check_cpu
OK: CPU load is ok.
'total 5m'=2%;80;90 'total 1m'=5%;80;90 'total 5s'=11%;80;90
# Custom thresholds
check_cpu "warn=load > 50" "crit=load > 70"
OK: CPU load is ok.
'total 5m'=2%;50;70 ...
# Per-core data instead of just the total
check_cpu filter=none "warn=load > 80" "crit=load > 90"
# Custom message
check_cpu "top-syntax=%(status): CPU usage is %(list)" \
"detail-syntax=%(time) avg: %(load)%"
OK: CPU usage is 5m avg: 2%, 1m avg: 5%, 5s avg: 11%
The rest of this page explains exactly what each of those options does.
3. Filters — Choosing What to Check
A filter is an expression evaluated for each item. Items where it is true are included; items where it is false
are dropped.
If you don't supply one, the check uses its default (e.g., check_cpu defaults to core = 'total').
Syntax
keyword operator value
Combine with and, or, not. Disable the default with filter=none.
check_cpu "filter=core = 'total'"
check_cpu "filter=load > 5 or core = 'total'"
check_service "filter=start_type = 'auto' and not name like 'clr_optimization'"
check_cpu filter=none # include everything
Operators
| Symbol | Safe alias | Meaning |
|---|---|---|
= |
eq |
Equals |
!= |
ne |
Not equals |
> < >= <= |
gt lt ge le |
Numeric comparison |
like |
— | Substring match |
regexp |
— | Regular expression |
in |
— | Membership in a list |
and or not |
— | Logical |
'...' |
str(...) |
String literal |
Note
Use the safe aliases (gt, lt, …) when passing arguments through NRPE or shells — they avoid </> redirection
problems. The same expression language powers filter, warn, crit, and the %(...) placeholders
in top-syntax / detail-syntax, so anything you learn here applies everywhere.
Common keywords
These are available in every check:
| Keyword | Meaning |
|---|---|
count |
Items matching the filter |
total |
Items before filtering |
ok_count / warn_count / crit_count |
Items in each state |
problem_count |
Warning + critical |
status |
Current overall status |
Each check adds its own keywords. For check_cpu: core, core_id, load, idle, kernel, time. Use
<check> help to discover them.
Practical filter recipes
# CPU: only the aggregate total
check_cpu "filter=core = 'total'"
# CPU: cores actually doing work, plus the total
check_cpu filter=none "filter=load > 5 or core = 'total'"
# Services: only auto-start, exclude one by name
check_service "filter=start_type = 'auto' and name != 'Spooler'"
# Event log: errors and criticals only
check_eventlog "filter=level in ('error', 'critical')"
# Disk: fixed and network drives
check_drivesize drive=* "filter=type in ('fixed', 'remote')"
Size and time units
| Suffix | Size | Time |
|---|---|---|
k / m / g / t |
KB / MB / GB / TB | — |
s / m / h / d / w |
— | seconds / minutes / hours / days / weeks |
check_memory "warn=free < 4g"
check_uptime "warn=uptime < 1d"
Past times use negative values — -1h means "less than an hour ago":
check_eventlog scan-range=-24h "crit=written > -1h"
4. Thresholds — Choosing What's a Problem
warn and crit use the same expression language as filters — same operators, same keywords,
same and/or/not. The difference is what the expression decides:
| Expression | Per-item question | Effect when true |
|---|---|---|
filter |
Should I include this item at all? | Item is kept (or dropped) |
warn |
Is this item a warning? | Overall status becomes WARNING |
crit |
Is this item critical? | Overall status becomes CRITICAL |
Aggregation rules:
- Any item matches
crit→ result is CRITICAL - Any item matches
warn(and none matchedcrit) → WARNING - Otherwise → OK
check_cpu "warn=load > 80" "crit=load > 90"
check_memory "warn=free < 20%" "crit=free < 10%"
check_drivesize "warn=free < 15%" "crit=free < 5%"
Disabling a threshold
check_cpu warning=none
check_cpu critical=none
Composite thresholds
# Warn on high kernel time OR high load
check_cpu filter=none "warn=kernel > 10 or load > 80" "crit=load > 90"
# Warn only on big machines that are tight on memory
check_memory "warn=free < 4g and size > 16g"
Aggregate thresholds
The count family of keywords lets you alert on totals rather than individual items:
# Alert if more than 3 services are stopped
check_service "crit=problem_count > 3"
# Alert if any matching event was written in the last hour
check_eventlog scan-range=-1w "crit=written > -1h"
empty-state — when nothing matches
What status to return when the filter selects no items:
| Value | Meaning |
|---|---|
ok |
Return OK (default for most checks) |
warning |
Return WARNING |
critical |
Return CRITICAL |
ignored |
Suppress the result |
check_service "filter=name = 'NonExistentService'" empty-state=ok
5. Output Syntax — Choosing the Message Text
Three options shape the message. They affect only the human-readable text — never the status or perfdata.
| Option | When applied | Default purpose |
|---|---|---|
top-syntax |
Always — the whole message | Status + list of problems |
detail-syntax |
Per item, inside the list | Per-item values |
ok-syntax |
When status is OK | Brief "all ok" message |
check_cpu defaults (still using the legacy ${...} form for top-syntax / detail-syntax; written
the recommended way they'd be top-syntax=%(status): %(problem_list) and
detail-syntax=%(time): %(load)%):
"top-syntax=${status}: ${problem_list}"
"ok-syntax=%(status): CPU load is ok."
"detail-syntax=${time}: ${load}%"
Template variables
NSClient++ has two placeholder forms. Use %(name) — it's the modern, more capable syntax.
| Form | Status | When it makes sense |
|---|---|---|
%(name) |
Preferred | Always. Survives shells/NRPE, supports nested parens and function calls. |
${name} |
Legacy | Still works for plain variable references; kept for backwards compatibility. |
Differences in practice:
- Plain variable references — both forms work and produce identical output.
- Function calls (see section 6) — only
%(...)works. The${...}form stops at the first}and can't capture nested parentheses. - Shells and NRPE —
${...}is often eaten by Bash and similar shells before it reaches NSClient++.%(...)passes through untouched.
The rest of this document uses %(...) in every example. ${...} is documented only where it
appears in defaults so you can recognise it in older configs.
Common variables (every check):
| Variable | Meaning |
|---|---|
%(status) |
OK / WARNING / CRITICAL / UNKNOWN |
%(list) |
All filtered items, joined with detail-syntax |
%(problem_list) |
Only warning/critical items |
%(ok_list) / %(warn_list) / %(crit_list) |
Items in each state |
%(count) / %(problem_count) |
Item counts |
Per-item variables (in detail-syntax) depend on the check — <check> help lists them.
Recipes
# Show all values, not just problems
check_cpu show-all
# equivalent to:
check_cpu "top-syntax=%(status): %(list)"
# Custom CPU message
check_cpu time=5m \
"top-syntax=%(status): Cpu usage is %(list)" "detail-syntax=%(load)%"
OK: Cpu usage is 26%
# Custom memory message
check_memory "top-syntax=%(list)" \
"detail-syntax=%(type) free: %(free) used: %(used) size: %(size)"
page free: 16G used: 7.98G size: 24G, physical free: 4.18G used: 7.8G size: 12G
# Service: name and state for each
check_service "top-syntax=%(list)" "detail-syntax=%(name): %(state)"
6. Functions — Transforming Values
Sometimes a raw value isn't what you want to look at. A counter that returns bytes per second is
unfriendly to read as 4194304, and a threshold expressed in MB is easier to maintain than one
expressed in 1024-scaled bytes. NSClient++ supports functions for these jobs, and the same
function works in both contexts:
- inside
detail-syntax/top-syntaxto format a value for display - inside
warn/crit/filterto derive a value for comparison
Calling a function
A function call looks like a normal function: name(arg1, arg2, …). Arguments can mix variables,
string literals, and numbers. Always wrap the call in %(...) when using it inside a syntax
template:
detail-syntax = "Used: %(format_bytes(used))"
warning = "convert_bytes(used, 'MB') > 500"
filter = "scale(rate, 1000000) > 100"
Function calls require %(...)
The legacy ${...} placeholder stops at the first } and can't capture nested parentheses, so
${format_bytes(used)} won't parse. Use %(...) for function calls (and prefer it everywhere
else — see section 5).
Built-in functions
These are available wherever the check exposes them — check_pdh ships them today; other checks
add their own. Run <check> help to see the list for a given check.
| Function | Returns | Purpose |
|---|---|---|
format_bytes(value) |
string | Auto-scaled human-readable bytes — 4194304 → "4MB" (1024-based) |
format_bytes(value, 'MB') |
string | Fixed unit. Units: B, K/KB, M/MB, G/GB, T/TB |
convert_bytes(value, 'MB') |
float | Numeric value in the named unit — use in thresholds |
scale(value, divisor) |
float | Divide by an arbitrary divisor — for decimal units (Mbps, etc.) |
Recipes
# Show raw bytes as MB/GB/etc. in the message
check_pdh counter=disk_bytes \
"detail-syntax=%(alias) is %(format_bytes(value))"
# Threshold in MB, display in human-friendly units
check_pdh counter=memory_bytes \
"warning=convert_bytes(value, 'MB') > 500" \
"detail-syntax=%(alias) = %(format_bytes(value))"
# Network rates — Mbps is decimal (10⁶), use scale()
check_pdh counter=bytes_per_sec \
"detail-syntax=Speed = %(scale(value, 1000000)) Mbps"
# Combine functions and plain variables in one template
check_pdh counter=disk_writes \
"detail-syntax=%(counter): %(format_bytes(value)) (%(value) raw bytes)"
How it composes with the rest of the language
Functions are first-class values, so the result composes with operators, and/or, and not:
# Warn if used MB AND free MB are both at risk
check_disk \
"warn=convert_bytes(used, 'MB') > 800 and convert_bytes(free, 'MB') < 100"
# Critical when ANY of two derived values is over budget
check_pdh ... \
"crit=scale(read_rate, 1000000) > 50 or scale(write_rate, 1000000) > 50"
The function result has a regular type (string or float), so it slots into comparisons exactly like
a plain variable reference. The arguments themselves can be variables, literals, or — for nested
calls — other function calls: %(format_bytes(scale(value, 1024))) is valid.
Variable-style shortcuts
Some checks expose pre-scaled "view variables" for the most common cases — check_pdh provides
value_human, value_mb, value_gb, etc. These are syntactic sugar for the corresponding function
calls:
# Variable-style shortcut
check_pdh counter=mem "detail-syntax=Mem: %(value_human)"
check_pdh counter=mem "warning=value_mb > 500"
# Equivalent function calls
check_pdh counter=mem "detail-syntax=Mem: %(format_bytes(value))"
check_pdh counter=mem "warning=convert_bytes(value, 'MB') > 500"
Reach for variables when one of the prebuilt units fits; reach for functions when you need a custom unit, a custom divisor, or composition with other expressions.
7. Performance Data
Performance data is the machine-readable metrics used for graphing, in the standard Nagios format:
'metric_name'=value[unit];[warn];[crit];[min];[max]
Example:
'total 5m'=2%;80;90 'total 1m'=5%;80;90 'total 5s'=11%;80;90
Perfdata needs thresholds
Without warn/crit, perfdata values are emitted but empty:
check_cpu warning=none critical=none
'total 5m'= 'total 1m'= 'total 5s'=
To get values without alerts, use perf-config to mark metrics as "extra":
check_cpu warning=none critical=none "perf-config=extra(load)"
Customising with perf-config
perf-config works like a tiny stylesheet — selectors target metrics, keys transform them.
"perf-config=selector(key:value; key:value) selector2(key:value)"
| Key | Effect |
|---|---|
unit |
Force a unit (G, M, K, %, ms, …) |
ignored |
true → drop this metric |
prefix / suffix |
Rename parts of the metric name |
minimum / min |
Force the perfdata min field (use min as a shorthand) |
maximum / max |
Force the perfdata max field — useful for graphing systems that auto-fit |
Selectors match in order of specificity: prefix.object.suffix → prefix.object → object.suffix → prefix →
suffix → object. The * selector matches everything.
Recipes
# Lock memory metrics to GB (avoids graph jumps when auto-scaling switches units)
check_memory "perf-config=*(unit:G)"
# Lock disk metrics to GB
check_drivesize "perf-config=*(unit:G)"
# Drop the percent metrics from check_drivesize, keep the absolute ones
check_drivesize "perf-config=used %(ignored:true)"
# Rename: drop suffix label, force GB
check_drivesize "perf-config=used.used(unit:G;suffix:'') used %(ignored:true)"
'C:\'=213G;178;201;0;223 'D:\'=400G;372;419;0;465
# Force min/max bounds on a counter that doesn't know its own range
# (e.g. a raw PDH counter exposed by check_pdh). The graphing system can
# then auto-scale to the declared range instead of guessing from history.
check_pdh "counter=\\Processor(_Total)\\% Processor Time" \
"perf-config=*(minimum:0;maximum:100)"
# Same idea on a custom queue-depth counter, with `min`/`max` shorthand.
check_pdh counter=queue_depth "perf-config=queue_depth(min:0;max:12345)"
Inspecting performance data
render_perf remove-perf command=check_drivesize
OK: OK:
C:\ used 213.605 GB 178.777 201.124 223.471 0
C:\ used % 95 % 79 89 100 0
perf-syntax — naming metrics
perf-syntax controls the metric name (not its value), using the same template variables as detail-syntax:
check_cpu "perf-syntax=%(core) %(time)" # 'total 5m'=...
check_cpu "perf-syntax=%(core)_%(time)" # 'total_5m'=...
Useful when your graphing system is picky about names.
8. Putting It Together
Pick a check, run show-default, identify the option you want to change, change just that one. Repeat.
# Default behaviour
check_drivesize
# Step 1 — only fixed disks
check_drivesize "filter=type = 'fixed'"
# Step 2 — tighter thresholds
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5"
# Step 3 — clean message with human-readable sizes (functions from section 6)
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5" \
"top-syntax=%(status): %(list)" \
"detail-syntax=%(drive_or_id) %(format_bytes(free)) free of %(format_bytes(size))"
# Step 4 — graph-friendly perfdata
check_drivesize "filter=type = 'fixed'" \
"warn=free_pct < 15" "crit=free_pct < 5" \
"perf-config=*(unit:G) used %(ignored:true)"