Alerts & Notifications documentation#21540
Alerts & Notifications documentation#21540kanelatechnical wants to merge 273 commits intonetdata:masterfrom
Conversation
Addressing cubic's comment
…via-config-files.md Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Chapters added with proper subdirectory structure: - 4: controlling-alerts-noise (5 files) - 5: receiving-notifications (5 files) - 6: alert-examples (6 files) - 7: troubleshooting-alerts (6 files) - 8: advanced-techniques (6 files) - 9: apis-alerts-events (6 files) - 10: cloud-alert-features (5 files) - 11: built-in-alerts.md - 12: best-practices.md - 13: architecture.md Update map.csv with new entries.
…chapter 1 pattern)
…ers now have subdirectories)
…tices, architecture)
…tching chapters 4-10 pattern)
…13 now 5 sections each)
…practices) - Add numbered subsections (11.1.x, 11.2.x, 12.1.x, etc.) - Add :::note boxes for important context - Add tables for state transitions and comparison data - Add Related Sections with internal doc links - Match styling pattern from earlier chapters
- Add numbered subsections (11.1.1, 11.2.1, etc.) - Add :::note boxes with collector requirements - Add Related Sections with internal doc links - Match styling pattern from earlier chapters
- Add numbered subsections (11.3.1, 11.3.2, etc.) - Add :::note box with collector requirements - Add Related Sections with internal doc links - Match styling pattern from other built-in-alerts files
Fixes context names in all built-in-alerts documentation files by verifying against actual source code in src/health/health.d/ and collector integration docs: - system-resource-alerts.md: Fixed encoding error (disk利用率 -> disk.util) - container-alerts.md: Fixed Docker contexts (docker.container_state, docker.container_health_status) and Kubernetes contexts (k8s_state.*, k8s.cgroup.*) - network-alerts.md: Fixed contexts (ping.host_rtt, ping.host_packet_loss, portcheck.status, x509check.*, dns_query.*, httpcheck.*) - hardware-alerts.md: Fixed contexts (adaptecraid.*, smartctl.*, sensors.*, apcupsd.*, ipmi.sensor_state) - application-alerts.md: Fixed database and web server contexts All contexts verified against actual collector documentation and stock health configuration files.
…ation - alert-types-alarm-vs-template.md: Fix example syntax (disk space -> disk.space, dimensions: -> chart labels:, remove non-existent from/to options) - disabling-alerts.md: Fix comparison operators (= -> ==) - core-system-alerts.md: Fix network errors context (net.net -> net.errors)
…nced alerts - custom-actions.md: Fix non-existent health.service context (use systemd.service_unit_state) Fix exec documentation - arguments are positional (), not environment variables Document all 34 positional arguments with their meanings - hysteresis.md: Fix status constants missing $ prefix (WARNING -> )
Add required blank lines after opening ::: markers and before closing ::: for proper Markdown rendering.
- Fix disk.chart syntax: change disk.space./ to disk_space./ (chart IDs use underscore, contexts use dot) - Fix alarm vs template: use template: for context-based rules, alarm: for chart-specific rules - Add index pages for built-in-alerts, best-practices, and architecture Ref: netdata#21333
- 1-system-resource: fix invalid disk context (was disk, now disk.backlog) - 2-container: remove non-existent k8s_state contexts (k8s_state.pod_condition, k8s_state.pod_container_restarts, k8s_state.node_condition) - 3-application: remove non-existent mysql.innodb contexts - 4-network: remove non-existent dns_query.query_time, fix httpcheck.response_time to use httpcheck.status dimension - 5-hardware: remove non-existent smartctl and sensors contexts, remove ups_input_voltage context All contexts verified against source code in src/health/health.d/
- Remove undocumented 'health check' terminology - Add RAISED status to documented statuses (RRDCALC_STATUS_RAISED = 2 in source) - Updated status counts from 6 to 7
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md">
<violation number="1" location="docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md:102">
P3: This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| ## Key Takeaways | ||
|
|
||
| - A Netdata alert is a **rule that monitors metrics from charts** and assigns a status | ||
| - Alerts have **seven possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `RAISED`, `WARNING`, `CRITICAL`, `REMOVED` |
There was a problem hiding this comment.
P3: This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md, line 102:
<comment>This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.</comment>
<file context>
@@ -98,7 +99,7 @@ For a given alert definition applied to a given chart instance, there is one **a
- A Netdata alert is a **rule that monitors metrics from charts** and assigns a status
-- Alerts have **six possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `WARNING`, `CRITICAL`, `REMOVED`
+- Alerts have **seven possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `RAISED`, `WARNING`, `CRITICAL`, `REMOVED`
- Status transitions become **alert events** visible locally and in Netdata Cloud
- Alerts can be **chart-specific (alarms)** or **context-based (templates)**
</file context>
- Remove false YAML-format claims from alert type documentation (no YAML-based unified alert format exists in source code) - Fix incorrect loading order in alert precedence documentation (custom config loads before stock, not vice versa) - Clarify file-shadowing behavior (entire stock file skipped when same-named user file exists, not per-alert override) Verified against src/libnetdata/paths/paths.c (config loading logic), src/health/health_prototypes.c (dyncfg override behavior), and RRDCALC_STATUS enum in src/health/rrdcalc.h.
RRDCALC_STATUS enum defines seven status values: REMOVED(-2), UNDEFINED(-1), UNINITIALIZED(0), CLEAR(1), RAISED(2), WARNING(3), CRITICAL(4) Previously documented only five, omitting REMOVED and UNINITIALIZED.
Removed assertions that lacked source code backing: - "Both will eventually be consolidated into single unified alert type" - "The alarm syntax is essentially a subset of template functionality" - "For new alert definitions, use template exclusively" Both alarm and template syntax are actively maintained with equal support in source code. Future roadmap claims require source evidence.
- Replace non-functional `families:` directive with working `chart labels:` directive (`families` is recognized but silently ignored by health engine per health_config.c) - Fix `host:` to correct `host labels:` directive (with space) - Add usage examples to clarify filtering syntax Source verified via src/health/reference.md showing correct directives: - `host labels: room = server` - `chart labels: mount_point=/mnt/disk1`
…g guide - Replace `families:` with `chart labels:` for targeting specific instances (`families:` is recognized but never used per health_config.c:842-846) - Fix inverted loading order: custom first, stock second (not vice versa) Verified via src/libnetdata/paths/paths.c:262 and :306 - Clarify file shadowing behavior (entire file skipped, not per-alert override)
The ok: line is not a valid Netdata health configuration directive. Source code confirms only warn: and crit: are recognized expression lines. This corrects inaccurate documentation that could mislead users.
The ok: line is not a valid Netdata health configuration directive. Only calc:, warn:, and crit: are valid expression lines.
The enabled: no directive is only valid globally in netdata.conf [health]. It does NOT work at the individual alert/template level. Correct methods for disabling specific alerts: pattern matching in netdata.conf (enabled alarms = !pattern), or leaving out warn:/crit: so the alert never triggers.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="docs/alerts/controlling-alerts-noise/1-disabling-alerts.md">
<violation number="1" location="docs/alerts/controlling-alerts-noise/1-disabling-alerts.md:35">
P2: The “disable a specific alert” example no longer disables evaluation; it defines `on/lookup/calc` and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit `enabled: no` override (or otherwise remove evaluation inputs) to show a true disable.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| # Disable stock alert that doesn't apply to our environment | ||
| # Use pattern matching in netdata.conf: enabled alarms = !mysql_* | ||
|
|
||
| # Or create an override that provides no evaluation logic: |
There was a problem hiding this comment.
P2: The “disable a specific alert” example no longer disables evaluation; it defines on/lookup/calc and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit enabled: no override (or otherwise remove evaluation inputs) to show a true disable.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/controlling-alerts-noise/1-disabling-alerts.md, line 35:
<comment>The “disable a specific alert” example no longer disables evaluation; it defines `on/lookup/calc` and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit `enabled: no` override (or otherwise remove evaluation inputs) to show a true disable.</comment>
<file context>
@@ -30,11 +30,14 @@ Add the alert you want to disable:
+# Use pattern matching in netdata.conf: enabled alarms = !mysql_*
-# Disable by setting enabled to no
+# Or create an override that provides no evaluation logic:
template: mysql_10s_slow_queries
- enabled: no
</file context>
…ables Section 8.4 documents positional arguments ($1-$33), not env vars. Clarifies that exec: line triggers custom scripts with positional params.
The path /var/log/netdata/health.log does not exist. Health-related logs go to /var/log/netdata/error.log.
Documentation claimed existence of $this(-1h) and $this(-5m) syntax for accessing past values of $this. This syntax does not exist in the codebase and was never implemented in the EVAL lexer/parser. Removed the misleading examples for: - Memory Leak Detection ($this - $this(-1h)) - Network Traffic Rate of Change (abs($this - $this(-5m))) Users who need time-based comparison should use the dual-template pattern shown in the Disk Days Remaining example, or explicit time-window lookups.
The doc incorrectly suggested checking health.log for notification failures.
Notification delivery errors are logged to NDLS_DAEMON facility (daemon.log),
not NDLS_HEALTH (health.log). Updated to reference daemon.log or error.log.
Also removed hardcoded path in favor of ${NETDATA_LOG_DIR} variable placeholder.
- Removed fake 'ok:' configuration line from hysteresis example * 'ok:' is not a valid Netdata health directive (only 'warn' and 'crit' exist) * Alarms automatically clear when warn/crit conditions are no longer true - Fixed JSON quoting in PagerDuty example: escaped double quotes in shell double-quoted strings are problematic - changed to use proper nesting
Changed INFO to CLEAR in severity table - INFO is not an alert status in Netdata, only CLEAR, WARNING, CRITICAL exist as triggering statuses
Changed 'typically every second' to default ~10 seconds based on source code (hardware.c:22 run_at_least_every_seconds = 10). Individual alerts use 'every' line for frequency, not a global 1-second tick.
Native Netdata uses role-based routing without internal escalation. PagerDuty, OpsGenie and similar services provide secondary recipient routing after timeouts.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="docs/alerts/architecture/3-notification-dispatch.md">
<violation number="1" location="docs/alerts/architecture/3-notification-dispatch.md:47">
P2: The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after **unacknowledged** timeouts, so this wording is misleading.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| Notification routing determines which recipients receive which alerts. Routing rules can filter by alert name, chart, host, or severity. | ||
|
|
||
| Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts. |
There was a problem hiding this comment.
P2: The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after unacknowledged timeouts, so this wording is misleading.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/architecture/3-notification-dispatch.md, line 47:
<comment>The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after **unacknowledged** timeouts, so this wording is misleading.</comment>
<file context>
@@ -44,7 +44,7 @@ Critical notification delivery should use redundant paths. Configure multiple no
Notification routing determines which recipients receive which alerts. Routing rules can filter by alert name, chart, host, or severity.
-Escalation policies route unacknowledged alerts to secondary recipients after timeout periods.
+Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts.
| Routing Factor | Description |
</file context>
| Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts. | |
| Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after unacknowledged timeouts. |
Source code eval-evaluate.c and unit tests prove logical operators (&&, ||) have precedence 2, comparisons have precedence 3. Original docs incorrectly stated comparisons bind tighter.
Variable does not exist in source code health_variable.c. Replaced with guidance on how to aggregate dimensions using explicit calc expressions.
Change lowercase 'disable'/'disable_all' to UPPERCASE 'DISABLE'/'DISABLE ALL' per source code in health_silencers.c which validates commands case-sensitively. Also corrected authentication header from X-Auth-Token to Authorization: Bearer since the management API uses a dedicated API key from /var/lib/netdata/netdata.api.key.
Bare 'repeat: 6h' is not valid per health_config.c parser which only accepts keyword prefixes: 'warning DURATION' or 'critical DURATION'. Added syntax clarification note and corrected example.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="docs/alerts/controlling-alerts-noise/4-reducing-flapping.md">
<violation number="1" location="docs/alerts/controlling-alerts-noise/4-reducing-flapping.md:83">
P3: The note introduces `warning:`/`critical:` with colons, which conflicts with the documented `warning DURATION`/`critical DURATION` syntax and the examples. This could mislead readers into using an invalid format.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| :::note | ||
|
|
||
| Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly. |
There was a problem hiding this comment.
P3: The note introduces warning:/critical: with colons, which conflicts with the documented warning DURATION/critical DURATION syntax and the examples. This could mislead readers into using an invalid format.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/controlling-alerts-noise/4-reducing-flapping.md, line 83:
<comment>The note introduces `warning:`/`critical:` with colons, which conflicts with the documented `warning DURATION`/`critical DURATION` syntax and the examples. This could mislead readers into using an invalid format.</comment>
<file context>
@@ -78,6 +78,12 @@ repeat: [off] [warning DURATION] [critical DURATION]
+:::note
+
+Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly.
+
+:::
</file context>
| Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly. | |
| Each duration requires its keyword prefix (`warning` or `critical`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly. |
Removed nonexistent SMTP_SERVER, SMTP_USER, SMTP_PASSWORD, FROM_ADDRESS vars from email configuration section. Actual config uses EMAIL_SENDER and sendmail command. Removed nonexistent PD_SERVICE_KEY var from PagerDuty section. Service key belongs in DEFAULT_RECIPIENT_PD value per official pagerduty/README.
Replaced reference to /var/log/netdata/error.log with /var/log/netdata/health.log - the only health-related log file. Source verification: src/libnetdata/log/nd_log-internals.c:346-384 defines log files as collector.log, daemon.log, health.log, debug.log. No error.log file exists.
Summary by cubic
Adds a complete, structured Alerts & Notifications docs set with 12 chapters, plus Cloud features and APIs. Updates the site map and fixes incorrect contexts, examples, formatting, API endpoints, and log paths across the new docs.
New Features
Refactors
Written for commit 8930b0e. Summary will update on new commits.