Skip to content

Alerts & Notifications documentation#21540

Draft
kanelatechnical wants to merge 273 commits intonetdata:masterfrom
kanelatechnical:pr-21333
Draft

Alerts & Notifications documentation#21540
kanelatechnical wants to merge 273 commits intonetdata:masterfrom
kanelatechnical:pr-21333

Conversation

@kanelatechnical
Copy link
Contributor

@kanelatechnical kanelatechnical commented Jan 11, 2026

Summary by cubic

Adds a complete, structured Alerts & Notifications docs set with 12 chapters, plus Cloud features and APIs. Updates the site map and fixes incorrect contexts, examples, formatting, API endpoints, and log paths across the new docs.

  • New Features

    • Full configuration reference; creation workflows (files and Cloud UI).
    • Control/noise reduction, notification models, recipient routing, and troubleshooting.
    • Examples and patterns; best practices; architecture.
    • APIs and Cloud features (events feed, deduplication, rooms, silencing).
  • Refactors

    • Restructured into chapter subdirectories with index pages and cross-links; removed duplicates.
    • Standardized formatting and note block rendering; corrected alert syntax and contexts; replaced fake with real stock alerts and aligned examples.
    • Fixed API endpoints and parameters (alarms, alarm_log, alarm_variables, manage health); standardized delay/repeat syntax; corrected log paths.
    • Updated docs/.map/map.csv and integration metadata with new pages and navigation; updated links across docs and code to point to the new alerts docs.

Written for commit 8930b0e. Summary will update on new commits.

kanelatechnical and others added 18 commits January 12, 2026 19:35
Addressing cubic's comment
…via-config-files.md

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Chapters added with proper subdirectory structure:
- 4: controlling-alerts-noise (5 files)
- 5: receiving-notifications (5 files)
- 6: alert-examples (6 files)
- 7: troubleshooting-alerts (6 files)
- 8: advanced-techniques (6 files)
- 9: apis-alerts-events (6 files)
- 10: cloud-alert-features (5 files)
- 11: built-in-alerts.md
- 12: best-practices.md
- 13: architecture.md

Update map.csv with new entries.
…practices)

- Add numbered subsections (11.1.x, 11.2.x, 12.1.x, etc.)
- Add :::note boxes for important context
- Add tables for state transitions and comparison data
- Add Related Sections with internal doc links
- Match styling pattern from earlier chapters
- Add numbered subsections (11.1.1, 11.2.1, etc.)
- Add :::note boxes with collector requirements
- Add Related Sections with internal doc links
- Match styling pattern from earlier chapters
- Add numbered subsections (11.3.1, 11.3.2, etc.)
- Add :::note box with collector requirements
- Add Related Sections with internal doc links
- Match styling pattern from other built-in-alerts files
Fixes context names in all built-in-alerts documentation files by
verifying against actual source code in src/health/health.d/ and
collector integration docs:

- system-resource-alerts.md: Fixed encoding error (disk利用率 -> disk.util)
- container-alerts.md: Fixed Docker contexts (docker.container_state,
  docker.container_health_status) and Kubernetes contexts (k8s_state.*,
  k8s.cgroup.*)
- network-alerts.md: Fixed contexts (ping.host_rtt, ping.host_packet_loss,
  portcheck.status, x509check.*, dns_query.*, httpcheck.*)
- hardware-alerts.md: Fixed contexts (adaptecraid.*, smartctl.*, sensors.*,
  apcupsd.*, ipmi.sensor_state)
- application-alerts.md: Fixed database and web server contexts

All contexts verified against actual collector documentation and stock
health configuration files.
@ilyam8 ilyam8 changed the title Alerts_ Alerts & Notifications documentation Jan 12, 2026
ilyam8 and others added 7 commits January 12, 2026 21:30
…ation

- alert-types-alarm-vs-template.md: Fix example syntax (disk space -> disk.space,
  dimensions: -> chart labels:, remove non-existent from/to options)
- disabling-alerts.md: Fix comparison operators (= -> ==)
- core-system-alerts.md: Fix network errors context (net.net -> net.errors)
…nced alerts

- custom-actions.md: Fix non-existent health.service context (use systemd.service_unit_state)
  Fix exec documentation - arguments are positional (), not environment variables
  Document all 34 positional arguments with their meanings
- hysteresis.md: Fix status constants missing $ prefix (WARNING -> )
Add required blank lines after opening ::: markers and before closing ::: for proper Markdown rendering.
- Fix disk.chart syntax: change disk.space./ to disk_space./
  (chart IDs use underscore, contexts use dot)
- Fix alarm vs template: use template: for context-based rules,
  alarm: for chart-specific rules
- Add index pages for built-in-alerts, best-practices, and architecture

Ref: netdata#21333
- 1-system-resource: fix invalid disk context (was disk, now disk.backlog)
- 2-container: remove non-existent k8s_state contexts (k8s_state.pod_condition,
  k8s_state.pod_container_restarts, k8s_state.node_condition)
- 3-application: remove non-existent mysql.innodb contexts
- 4-network: remove non-existent dns_query.query_time, fix httpcheck.response_time
  to use httpcheck.status dimension
- 5-hardware: remove non-existent smartctl and sensors contexts, remove
  ups_input_voltage context

All contexts verified against source code in src/health/health.d/
- Remove undocumented 'health check' terminology
- Add RAISED status to documented statuses (RRDCALC_STATUS_RAISED = 2 in source)
- Updated status counts from 6 to 7
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md">

<violation number="1" location="docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md:102">
P3: This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

## Key Takeaways

- A Netdata alert is a **rule that monitors metrics from charts** and assigns a status
- Alerts have **seven possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `RAISED`, `WARNING`, `CRITICAL`, `REMOVED`
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/understanding-alerts/1-what-is-a-netdata-alert.md, line 102:

<comment>This change makes the chapter inconsistent with its README, which still says alerts have only five statuses. Update the README (and any other summary text) to align with the new seven-status definition, or adjust the list here if seven is not intended.</comment>

<file context>
@@ -98,7 +99,7 @@ For a given alert definition applied to a given chart instance, there is one **a
 
 - A Netdata alert is a **rule that monitors metrics from charts** and assigns a status
-- Alerts have **six possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `WARNING`, `CRITICAL`, `REMOVED`
+- Alerts have **seven possible statuses**: `UNINITIALIZED`, `UNDEFINED`, `CLEAR`, `RAISED`, `WARNING`, `CRITICAL`, `REMOVED`
 - Status transitions become **alert events** visible locally and in Netdata Cloud
 - Alerts can be **chart-specific (alarms)** or **context-based (templates)**
</file context>
Fix with Cubic

- Remove false YAML-format claims from alert type documentation
  (no YAML-based unified alert format exists in source code)
- Fix incorrect loading order in alert precedence documentation
  (custom config loads before stock, not vice versa)
- Clarify file-shadowing behavior (entire stock file skipped when
  same-named user file exists, not per-alert override)

Verified against src/libnetdata/paths/paths.c (config loading logic),
src/health/health_prototypes.c (dyncfg override behavior), and
RRDCALC_STATUS enum in src/health/rrdcalc.h.
RRDCALC_STATUS enum defines seven status values:
REMOVED(-2), UNDEFINED(-1), UNINITIALIZED(0), CLEAR(1),
RAISED(2), WARNING(3), CRITICAL(4)

Previously documented only five, omitting REMOVED and UNINITIALIZED.
Removed assertions that lacked source code backing:
- "Both will eventually be consolidated into single unified alert type"
- "The alarm syntax is essentially a subset of template functionality"
- "For new alert definitions, use template exclusively"

Both alarm and template syntax are actively maintained with equal
support in source code. Future roadmap claims require source evidence.
- Replace non-functional `families:` directive with working `chart labels:` directive
  (`families` is recognized but silently ignored by health engine per health_config.c)
- Fix `host:` to correct `host labels:` directive (with space)
- Add usage examples to clarify filtering syntax

Source verified via src/health/reference.md showing correct directives:
- `host labels: room = server`
- `chart labels: mount_point=/mnt/disk1`
…g guide

- Replace `families:` with `chart labels:` for targeting specific instances
  (`families:` is recognized but never used per health_config.c:842-846)
- Fix inverted loading order: custom first, stock second (not vice versa)
  Verified via src/libnetdata/paths/paths.c:262 and :306
- Clarify file shadowing behavior (entire file skipped, not per-alert override)
The ok: line is not a valid Netdata health configuration directive.
Source code confirms only warn: and crit: are recognized expression lines.

This corrects inaccurate documentation that could mislead users.
The ok: line is not a valid Netdata health configuration directive.
Only calc:, warn:, and crit: are valid expression lines.
The enabled: no directive is only valid globally in netdata.conf [health].
It does NOT work at the individual alert/template level.

Correct methods for disabling specific alerts: pattern matching in
netdata.conf (enabled alarms = !pattern), or leaving out warn:/crit:
so the alert never triggers.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/alerts/controlling-alerts-noise/1-disabling-alerts.md">

<violation number="1" location="docs/alerts/controlling-alerts-noise/1-disabling-alerts.md:35">
P2: The “disable a specific alert” example no longer disables evaluation; it defines `on/lookup/calc` and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit `enabled: no` override (or otherwise remove evaluation inputs) to show a true disable.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

# Disable stock alert that doesn't apply to our environment
# Use pattern matching in netdata.conf: enabled alarms = !mysql_*

# Or create an override that provides no evaluation logic:
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The “disable a specific alert” example no longer disables evaluation; it defines on/lookup/calc and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit enabled: no override (or otherwise remove evaluation inputs) to show a true disable.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/controlling-alerts-noise/1-disabling-alerts.md, line 35:

<comment>The “disable a specific alert” example no longer disables evaluation; it defines `on/lookup/calc` and only omits thresholds, which is closer to “keep alert loaded but never trigger.” This conflicts with the section’s definition of disabling (stop evaluation entirely) and can mislead readers. Restore the explicit `enabled: no` override (or otherwise remove evaluation inputs) to show a true disable.</comment>

<file context>
@@ -30,11 +30,14 @@ Add the alert you want to disable:
+# Use pattern matching in netdata.conf: enabled alarms = !mysql_*
 
-# Disable by setting enabled to no
+# Or create an override that provides no evaluation logic:
 template: mysql_10s_slow_queries
-   enabled: no
</file context>
Fix with Cubic

…ables

Section 8.4 documents positional arguments ($1-$33), not env vars.
Clarifies that exec: line triggers custom scripts with positional params.
The path /var/log/netdata/health.log does not exist.
Health-related logs go to /var/log/netdata/error.log.
Documentation claimed existence of $this(-1h) and $this(-5m) syntax for
accessing past values of $this. This syntax does not exist in the codebase
and was never implemented in the EVAL lexer/parser.

Removed the misleading examples for:
- Memory Leak Detection ($this - $this(-1h))
- Network Traffic Rate of Change (abs($this - $this(-5m)))

Users who need time-based comparison should use the dual-template pattern
shown in the Disk Days Remaining example, or explicit time-window lookups.
The doc incorrectly suggested checking health.log for notification failures.
Notification delivery errors are logged to NDLS_DAEMON facility (daemon.log),
not NDLS_HEALTH (health.log). Updated to reference daemon.log or error.log.

Also removed hardcoded path in favor of ${NETDATA_LOG_DIR} variable placeholder.
- Removed fake 'ok:' configuration line from hysteresis example
  * 'ok:' is not a valid Netdata health directive (only 'warn' and 'crit' exist)
  * Alarms automatically clear when warn/crit conditions are no longer true

- Fixed JSON quoting in PagerDuty example: escaped double quotes in shell
  double-quoted strings are problematic - changed to use proper nesting
Changed INFO to CLEAR in severity table - INFO is not an alert status
in Netdata, only CLEAR, WARNING, CRITICAL exist as triggering statuses
Changed 'typically every second' to default ~10 seconds based on source code
(hardware.c:22 run_at_least_every_seconds = 10). Individual alerts use 'every'
line for frequency, not a global 1-second tick.
Native Netdata uses role-based routing without internal escalation. PagerDuty,
OpsGenie and similar services provide secondary recipient routing after timeouts.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/alerts/architecture/3-notification-dispatch.md">

<violation number="1" location="docs/alerts/architecture/3-notification-dispatch.md:47">
P2: The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after **unacknowledged** timeouts, so this wording is misleading.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


Notification routing determines which recipients receive which alerts. Routing rules can filter by alert name, chart, host, or severity.

Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after unacknowledged timeouts, so this wording is misleading.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/architecture/3-notification-dispatch.md, line 47:

<comment>The phrase "acknowledged timeouts" reverses the escalation condition. Escalation services route to secondary recipients after **unacknowledged** timeouts, so this wording is misleading.</comment>

<file context>
@@ -44,7 +44,7 @@ Critical notification delivery should use redundant paths. Configure multiple no
 Notification routing determines which recipients receive which alerts. Routing rules can filter by alert name, chart, host, or severity.
 
-Escalation policies route unacknowledged alerts to secondary recipients after timeout periods.
+Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts.
 
 | Routing Factor | Description |
</file context>
Suggested change
Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after acknowledged timeouts.
Native Netdata routing uses roles and filters but does not implement escalation timelines. External escalation services like PagerDuty or OpsGenie provide secondary recipient routing after unacknowledged timeouts.
Fix with Cubic

Source code eval-evaluate.c and unit tests prove logical operators
(&&, ||) have precedence 2, comparisons have precedence 3.
Original docs incorrectly stated comparisons bind tighter.
Variable does not exist in source code health_variable.c.
Replaced with guidance on how to aggregate dimensions using explicit calc expressions.
Change lowercase 'disable'/'disable_all' to UPPERCASE 'DISABLE'/'DISABLE ALL'
per source code in health_silencers.c which validates commands case-sensitively.

Also corrected authentication header from X-Auth-Token to Authorization: Bearer
since the management API uses a dedicated API key from /var/lib/netdata/netdata.api.key.
Bare 'repeat: 6h' is not valid per health_config.c parser which only accepts
keyword prefixes: 'warning DURATION' or 'critical DURATION'.
Added syntax clarification note and corrected example.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/alerts/controlling-alerts-noise/4-reducing-flapping.md">

<violation number="1" location="docs/alerts/controlling-alerts-noise/4-reducing-flapping.md:83">
P3: The note introduces `warning:`/`critical:` with colons, which conflicts with the documented `warning DURATION`/`critical DURATION` syntax and the examples. This could mislead readers into using an invalid format.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


:::note

Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The note introduces warning:/critical: with colons, which conflicts with the documented warning DURATION/critical DURATION syntax and the examples. This could mislead readers into using an invalid format.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/alerts/controlling-alerts-noise/4-reducing-flapping.md, line 83:

<comment>The note introduces `warning:`/`critical:` with colons, which conflicts with the documented `warning DURATION`/`critical DURATION` syntax and the examples. This could mislead readers into using an invalid format.</comment>

<file context>
@@ -78,6 +78,12 @@ repeat: [off] [warning DURATION] [critical DURATION]
 
+:::note
+
+Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly.
+
+:::
</file context>
Suggested change
Each duration requires its keyword prefix (`warning:` or `critical:`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly.
Each duration requires its keyword prefix (`warning` or `critical`). A lone duration like `repeat: 6h` is not valid—specify `repeat: warning 6h` and/or `repeat: critical 6h` explicitly.
Fix with Cubic

Removed nonexistent SMTP_SERVER, SMTP_USER, SMTP_PASSWORD, FROM_ADDRESS vars from email
configuration section. Actual config uses EMAIL_SENDER and sendmail command.

Removed nonexistent PD_SERVICE_KEY var from PagerDuty section. Service key belongs
in DEFAULT_RECIPIENT_PD value per official pagerduty/README.
Replaced reference to /var/log/netdata/error.log with
/var/log/netdata/health.log - the only health-related log file.

Source verification: src/libnetdata/log/nd_log-internals.c:346-384
defines log files as collector.log, daemon.log, health.log, debug.log.
No error.log file exists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants