Appearance
Alerting
Yellowbrick supports integration with Slack and Opsgenie for alerting. When unexpected issues are detected, an alert will be sent to one or both of these channels.
Configuration of alerting endpoints is accomplished through usage of kubectl commands. In a subsequent release a friendly user interface will be added to Yellowbrick Manager.
Yellowbrick currently alerts on the following unexpected exceptional conditions:
- Disc space low or exhausted
- Unexpected crashes or process exits
- Issues with background tasks
- File system consistency issues
- Quota exhaustion
- Row store volume exhaustion
For a detailed list of the alerts Yellowbrick currently includes see Observability Alerts.
The workload manager is also capable of generating rule-based custom alerts in the event of conditions such as queries running too long, users hogging resources or similar. Such alerts will also be dispatched through this mechanism. See the workload manager rule actions for more information.
Step 1: Collect Integration Information
Alerts can be sent to Slack, Opsgenie, or both tools. To configure alerting for Slack, you need to know the URL and channel name. For Opsgenie, you need an API key and, optionally, an API URL.
To find your slack URL, follow the instructions here. Make sure that a target Slack channel (beginning with a #) has been created in advance. To create an Opsgenie API key, and work out which URL is pertinent, follow the instructions here.
Step 2: Create a JSON Configuration File
To configure alerting, either the slack configuration, the Opsgenie configuration or both must be specified in a JSON document and uploaded to a Kubernetes secret. To do so, create a document called alert.json as follows:
bash
echo '{
"slackChannel": <slackChannelName>,
"slackUrl": <slackURL>,
"opsGenieKey": <opsGenieKey>
"opsGenieUrl": <opsGenieURL>
}' > alert.jsonFor Slack configuration, both slackChannel and slackURL must be specified. The channel must be prefixed by a # character and must be created in advance.
The opsGenieUrl is an optional parameter, defaulting to the global https://api.opsgenie.com if omitted.
A fully formed example JSON file might look something like:
bash
echo '{
"slackChannel": "#alerts",
"slackUrl": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX",
"opsGenieKey": "12345-abcde-67890-fghij-12345"
}' > alert.jsonStep 3: Install the JSON Configuration File
The JSON document must be installed into a Kubernetes secret called yb-monitoring-secret. To do so, use the following kubectl command:
bash
kubectl create secret generic yb-monitoring-secret --from-file=state=alert.json -n monitoringThe JSON document must be well formed, and the endpoints specified correctly, with egress permitted if running in a fully private configuration.
Step 4: Generate a Test Alert
Alerts are based on Prometheus expressions or direct calls to Alertmanager. Here we will show an example of how to generate an alert using both methods.
Direct Call to Alertmanager
The following example sets up port-forwarding to the Kubernetes Alertmanager service and then uses curl to send an alert that will expire after 5 minutes. It can take a minute before you will see the alert displayed in Opsgenie or Slack but it will be immediately visible when querying the Alertmanager API directly. If you do not want an automatic resolve message to be sent to Opsgenie and/or Slack after 5 minutes have elapsed set send_resolved to false.
bash
# In one terminal setup execute the following to setup port-forwarding
kubectl port-forward -n monitoring svc/loki-prometheus-alertmanager 9093:9093
# In a second terminal POST the alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d "[
{
\"labels\": {
\"alert_name\": \"Yellowbrick Test Alert\",
\"alert_type\": \"observability\",
\"send_resolved\": \"true\",
\"severity\": \"CRITICAL\",
\"namespace\": \"test_namespace\"
},
\"annotations\": {
\"summary\": \"This is a summary\",
\"description\": \"This is a description\"
},
\"startsAt\": \"$(date -u +"%Y-%m-%dT%H:%M:%SZ")\",
\"endsAt\": \"$(date -u -d '+5 minutes' +"%Y-%m-%dT%H:%M:%SZ")\"
}
]"
# Optionally query the Alertmanager API directly to see the alert.
curl http://localhost:9093/api/v2/alerts
# Which will produce
[
{
"annotations": {
"description": "This is a description",
"summary": "This is a summary"
},
"endsAt": "...",
"fingerprint": "...",
"receivers": [
{
"name": "team-pager"
},
{
"name": "yb-observability-slack-receiver-nosendresolved"
}
],
"startsAt": "...",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "...",
"labels": {
"alert_name": "Yellowbrick Test Alert",
"alert_type": "observability",
"namespace": "test_namespace",
"send_resolved": "false",
"severity": "CRITICAL"
}
},
...If you query the Alertmanager API directly you will also see the Watchdog alert from the Always Firing Alerts.
Prometheus-Based Alert
Prometheus-based alerts are specified in config maps stored in the monitoring namespace. In order for the alert to be picked up by Prometheus (and sent to Alertmanager) it must have the label alert-rule set to true. The yaml file defined in the config map will be mounted in the Prometheus server pod where it will be read by Prometheus automatically. The syntax and configuration for the alert follow standard Prometheus alerting rules.
The following is a trivial alert that will fire when any instance has been up for less than 600 seconds (10 minutes); suspending and resuming an instance will reset the yb_system_uptime_seconds measurement for that instance. Like the previous example, the alert will be shown in Opsgenie and/or Slack if they are configured, but can also be seen by querying the Alertmanager API directly. See Observability Metrics for other exposed metrics to base alerts on.
bash
cat <<'EOF' > alert-600-second-uptime.yaml
apiVersion: v1
data:
alert-system-up-less-than-600-seconds.yaml: |
groups:
- name: example-alerts
rules:
- alert: "System Up Less Than 600 seconds"
expr: yb_system_uptime_seconds < 600
labels:
severity: CRITICAL
alert_type: observability
send_resolved: true
namespace: '{{ $labels.namespace }}'
yellowbrick_io_instance_id: '{{ $labels.yellowbrick_io_instance_id }}'
yellowbrick_io_instance_name: '{{ $labels.yellowbrick_io_instance_name }}'
annotations:
summary: "The system has been up less than 600 seconds"
description: "The system has only been up {{ $value }} seconds."
kind: ConfigMap
metadata:
name: alert-system-up-less-than-600-seconds
namespace: monitoring
labels:
alert-rule: "true"
EOF
kubectl -n monitoring apply -f ./alert-600-second-uptime.yamlTo remove the alert delete the config map:
bash
kubectl delete -n monitoring configmap alert-system-up-less-than-600-secondsDiagnosing Problems
In the case of a malformed JSON document, or missing or malformed keys in the document, errors will be posted to the Yellowbrick Operator logs. To inspect the Operator logs, use the following command:
bash
kubectl logs -l app=yb-operator -n <operator_namespace> -f | grep yb-monitoring-secretAn example of an error due to malformed JSON looks something like this:
txt
2024-01-02T03:04:05Z ERROR Secret.Monitoring Invalid alerting configuration: unable to deserialize the configuration, it might not be a valid json {"namespace": "monitoring", "name": "yb-monitoring-secret", "error": "unexpected end of JSON input"}In the case of issues sending an alert, errors will be posted to the Alertmanager logs. To inspect the Alertmanager logs, first find the pod name and then retrieve the logs as follows:
bash
kubectl get pod -l component=alertmanager -n monitoring
kubectl logs <pod_name_from_above_command> -n monitoring -f prometheus-alertmanagerAn example of errors sending alerts to Slack and Opsgenie respectively look something like this:
txt
ts=2024-08-22T03:43:18.613Z caller=dispatch.go:353 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default-slack-receiver/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"#alerts\": unexpected status code 404: no_team"
ts=2024-08-22T03:43:18.692Z caller=dispatch.go:353 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="team-pager/opsgenie[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 422: {\"message\":\"Key format is not valid!\",\"took\":0.001,\"requestId\":\"d36de5df-7e94-40cb-b09a-d274e14aad48\"}"Disabling Alerting
To completely disable alerting, you can just delete the secret. To do so, use the following command:
bash
kubectl delete secret yb-monitoring-secret -n monitoring