Alerting

Yellowbrick supports integration with Slack and Opsgenie for alerting. When unexpected issues are detected, an alert will be sent to one or both of these channels.

Configuration of alerting endpoints is accomplished through usage of kubectl commands. In a subsequent release a friendly user interface will be added to Yellowbrick Manager.

Yellowbrick currently alerts on the following unexpected exceptional conditions:

Disc space low or exhausted
Unexpected crashes or process exits
Issues with background tasks
File system consistency issues
Quota exhaustion
Row store volume exhaustion

For a detailed list of the alerts Yellowbrick currently includes see Observability Alerts.

The workload manager is also capable of generating rule-based custom alerts in the event of conditions such as queries running too long, users hogging resources or similar. Such alerts will also be dispatched through this mechanism. See the workload manager rule actions for more information.

Step 1: Collect Integration Information

Alerts can be sent to Slack, Opsgenie, or both tools. To configure alerting for Slack, you need to know the URL and channel name. For Opsgenie, you need an API key and, optionally, an API URL.

To find your slack URL, follow the instructions here. Make sure that a target Slack channel (beginning with a #) has been created in advance. To create an Opsgenie API key, and work out which URL is pertinent, follow the instructions here.

Step 2: Create a JSON Configuration File

To configure alerting, either the slack configuration, the Opsgenie configuration or both must be specified in a JSON document and uploaded to a Kubernetes secret. To do so, create a document called alert.json as follows:

bash

echo '{
  "slackChannel": <slackChannelName>,
  "slackUrl": <slackURL>,
  "opsGenieKey": <opsGenieKey>
  "opsGenieUrl": <opsGenieURL>
}' > alert.json

For Slack configuration, both slackChannel and slackURL must be specified. The channel must be prefixed by a # character and must be created in advance.

The opsGenieUrl is an optional parameter, defaulting to the global https://api.opsgenie.com if omitted.

A fully formed example JSON file might look something like:

bash

echo '{
  "slackChannel": "#alerts",
  "slackUrl": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX",
  "opsGenieKey": "12345-abcde-67890-fghij-12345"
}' > alert.json

Step 3: Install the JSON Configuration File

The JSON document must be installed into a Kubernetes secret called yb-monitoring-secret. To do so, use the following kubectl command:

bash

kubectl create secret generic yb-monitoring-secret --from-file=state=alert.json -n monitoring

The JSON document must be well formed, and the endpoints specified correctly, with egress permitted if running in a fully private configuration.

Step 4: Generate a Test Alert

Alerts are based on Prometheus expressions or direct calls to Alertmanager. Here we will show an example of how to generate an alert using both methods.

Direct Call to Alertmanager

The following example sets up port-forwarding to the Kubernetes Alertmanager service and then uses curl to send an alert that will expire after 5 minutes. It can take a minute before you will see the alert displayed in Opsgenie or Slack but it will be immediately visible when querying the Alertmanager API directly. If you do not want an automatic resolve message to be sent to Opsgenie and/or Slack after 5 minutes have elapsed set send_resolved to false.

bash

# In one terminal setup execute the following to setup port-forwarding
kubectl port-forward -n monitoring svc/loki-prometheus-alertmanager 9093:9093

# In a second terminal POST the alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d "[
    {
      \"labels\": {
        \"alert_name\": \"Yellowbrick Test Alert\",
        \"alert_type\": \"observability\",
        \"send_resolved\": \"true\",
        \"severity\": \"CRITICAL\",
        \"namespace\": \"test_namespace\"
      },
      \"annotations\": {
        \"summary\": \"This is a summary\",
        \"description\": \"This is a description\"
      },
      \"startsAt\": \"$(date -u +"%Y-%m-%dT%H:%M:%SZ")\",
      \"endsAt\": \"$(date -u -d '+5 minutes' +"%Y-%m-%dT%H:%M:%SZ")\"
    }
  ]"

# Optionally query the Alertmanager API directly to see the alert.
curl http://localhost:9093/api/v2/alerts

# Which will produce
[
  {
    "annotations": {
      "description": "This is a description",
      "summary": "This is a summary"
    },
    "endsAt": "...",
    "fingerprint": "...",
    "receivers": [
      {
        "name": "team-pager"
      },
      {
        "name": "yb-observability-slack-receiver-nosendresolved"
      }
    ],
    "startsAt": "...",
    "status": {
      "inhibitedBy": [],
      "silencedBy": [],
      "state": "active"
    },
    "updatedAt": "...",
    "labels": {
      "alert_name": "Yellowbrick Test Alert",
      "alert_type": "observability",
      "namespace": "test_namespace",
      "send_resolved": "false",
      "severity": "CRITICAL"
    }
  },
  ...

If you query the Alertmanager API directly you will also see the Watchdog alert from the Always Firing Alerts.

Prometheus-Based Alert

Prometheus-based alerts are specified in config maps stored in the monitoring namespace. In order for the alert to be picked up by Prometheus (and sent to Alertmanager) it must have the label alert-rule set to true. The yaml file defined in the config map will be mounted in the Prometheus server pod where it will be read by Prometheus automatically. The syntax and configuration for the alert follow standard Prometheus alerting rules.

The following is a trivial alert that will fire when any instance has been up for less than 600 seconds (10 minutes); suspending and resuming an instance will reset the yb_system_uptime_seconds measurement for that instance. Like the previous example, the alert will be shown in Opsgenie and/or Slack if they are configured, but can also be seen by querying the Alertmanager API directly. See Observability Metrics for other exposed metrics to base alerts on.

bash

cat <<'EOF' > alert-600-second-uptime.yaml
apiVersion: v1
data:
  alert-system-up-less-than-600-seconds.yaml: |
    groups:
      - name: example-alerts
        rules:
          - alert: "System Up Less Than 600 seconds"
            expr: yb_system_uptime_seconds < 600
            labels:
              severity: CRITICAL
              alert_type: observability
              send_resolved: true
              namespace: '{{ $labels.namespace }}'
              yellowbrick_io_instance_id: '{{ $labels.yellowbrick_io_instance_id }}'
              yellowbrick_io_instance_name: '{{ $labels.yellowbrick_io_instance_name }}'
            annotations:
              summary: "The system has been up less than 600 seconds"
              description: "The system has only been up {{ $value }} seconds."

kind: ConfigMap
metadata:
  name: alert-system-up-less-than-600-seconds
  namespace: monitoring
  labels:
    alert-rule: "true"
EOF

kubectl -n monitoring apply -f ./alert-600-second-uptime.yaml

To remove the alert delete the config map:

bash

kubectl delete -n monitoring configmap alert-system-up-less-than-600-seconds

Diagnosing Problems

In the case of a malformed JSON document, or missing or malformed keys in the document, errors will be posted to the Yellowbrick Operator logs. To inspect the Operator logs, use the following command:

bash

kubectl logs -l app=yb-operator -n <operator_namespace> -f | grep yb-monitoring-secret

An example of an error due to malformed JSON looks something like this:

txt

2024-01-02T03:04:05Z    ERROR  Secret.Monitoring        Invalid alerting configuration: unable to deserialize the configuration, it might not be a valid json      {"namespace": "monitoring", "name": "yb-monitoring-secret", "error": "unexpected end of JSON input"}

In the case of issues sending an alert, errors will be posted to the Alertmanager logs. To inspect the Alertmanager logs, first find the pod name and then retrieve the logs as follows:

bash

kubectl get pod -l component=alertmanager -n monitoring
kubectl logs <pod_name_from_above_command> -n monitoring -f prometheus-alertmanager

An example of errors sending alerts to Slack and Opsgenie respectively look something like this:

txt

ts=2024-08-22T03:43:18.613Z caller=dispatch.go:353 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="default-slack-receiver/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"#alerts\": unexpected status code 404: no_team"
ts=2024-08-22T03:43:18.692Z caller=dispatch.go:353 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="team-pager/opsgenie[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 422: {\"message\":\"Key format is not valid!\",\"took\":0.001,\"requestId\":\"d36de5df-7e94-40cb-b09a-d274e14aad48\"}"

Disabling Alerting

To completely disable alerting, you can just delete the secret. To do so, use the following command:

bash

kubectl delete secret yb-monitoring-secret -n monitoring

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause

Alerting

Step 1: Collect Integration Information

Step 2: Create a JSON Configuration File

Step 3: Install the JSON Configuration File

Step 4: Generate a Test Alert

Direct Call to Alertmanager

Prometheus-Based Alert

Diagnosing Problems

Disabling Alerting