Skip to content

Cluster Alerts

Cluster alerts notify about the operational health of the Yellowbrick cluster as a whole. These alerts help detect and respond to states where the system is either partially or entirely degraded, or explicitly suspended.

Purpose

These alerts are triggered based on cluster-level availability and control plane signals. They serve to identify when:

  • The cluster is offline, such as when too many compute nodes are unreachable.
  • The cluster is degraded, with some compute nodes unavailable but not enough to cause a full outage.
  • The cluster is in a suspended state, which is expected during administrative operations.

Values such as [missing_compute_nodes] and [reason] are dynamic and inserted at runtime to explain the nature of the issue.

Alert Definitions

AlertSeverityTrigger AfterDescriptionThreshold Ref
Cluster StateCRITICAL2mThe cluster is offline: [reason].
Cluster StateCRITICAL2mThe cluster is offline: too many compute nodes ([missing_compute_nodes]) are unavailable.
Cluster StateMAJOR5mThe cluster is degraded: [reason].
Cluster StateMAJOR5mThe cluster is degraded: missing compute nodes ([missing_compute_nodes]).
Cluster StateINFO0mThe cluster was suspended by an administrator.