Appearance
Compute Node Alerts
Compute node alerts monitor the health and stability of individual compute node pods in the Yellowbrick cluster. These alerts help detect unexpected crashes, normal exits, and diagnostic events such as minidump generation.
Purpose
These alerts are useful for identifying:
- Unexpected failures, such as crashes or minidumps, which often warrant immediate investigation.
- Normal exits, which can be useful for tracking lifecycle operations (e.g. restarts or rolling upgrades), even if they don't require alerting action.
Minidump alerts provide early signals of crashes that may generate diagnostic data. In contrast, info-level alerts like Compute Node Exit with exit code 0 are logged for observability but not intended to trigger notification policies or automated response actions.
Alert Definitions
| Alert | Severity | Trigger After | Description | Threshold Ref |
|---|---|---|---|---|
Compute Node Exited | VARIES | - | A compute node has exited. Severity varies by exit code: INFO for graceful exit (code 0), CRITICAL for hardware or software crashes. Reason is included in the alert description at runtime. | |
Compute Node Minidump | CRITICAL | - | Minidump detected on the compute node pod. |