Appearance
Cluster Alerts
Cluster alerts notify about the operational health of the Yellowbrick cluster as a whole. These alerts help detect and respond to states where the system is either partially or entirely degraded, or explicitly suspended.
Purpose
These alerts are triggered based on cluster-level availability and control plane signals. They serve to identify when:
- The cluster is offline, such as when too many compute nodes are unreachable.
- The cluster is degraded, with some compute nodes unavailable but not enough to cause a full outage.
- The cluster is in a suspended state, which is expected during administrative operations.
Values such as [missing_compute_nodes] and [reason] are dynamic and inserted at runtime to explain the nature of the issue.
Alert Definitions
| Alert | Severity | Trigger After | Description | Threshold Ref |
|---|---|---|---|---|
Cluster State | CRITICAL | 2m | The cluster is offline: [reason]. | |
Cluster State | CRITICAL | 2m | The cluster is offline: too many compute nodes ([missing_compute_nodes]) are unavailable. | |
Cluster State | MAJOR | 5m | The cluster is degraded: [reason]. | |
Cluster State | MAJOR | 5m | The cluster is degraded: missing compute nodes ([missing_compute_nodes]). | |
Cluster State | INFO | 0m | The cluster was suspended by an administrator. |