Skip to content

Compute Node Alerts

Compute node alerts monitor the health and stability of individual compute node pods in the Yellowbrick cluster. These alerts help detect unexpected crashes, normal exits, and diagnostic events such as minidump generation.

Purpose

These alerts are useful for identifying:

  • Unexpected failures, such as crashes or minidumps, which often warrant immediate investigation.
  • Normal exits, which can be useful for tracking lifecycle operations (e.g. restarts or rolling upgrades), even if they don't require alerting action.

Minidump alerts provide early signals of crashes that may generate diagnostic data. In contrast, info-level alerts like Compute Node Exit with exit code 0 are logged for observability but not intended to trigger notification policies or automated response actions.

Alert Definitions

AlertSeverityTrigger AfterDescriptionThreshold Ref
Compute Node ExitedVARIES-A compute node has exited. Severity varies by exit code: INFO for graceful exit (code 0), CRITICAL for hardware or software crashes. Reason is included in the alert description at runtime.
Compute Node MinidumpCRITICAL-Minidump detected on the compute node pod.