Skip to content

Observability Overview

This section provides a comprehensive reference for Prometheus metrics and alerting rules used to monitor system health, performance, and reliability across cloud platform components.

Metrics

The Metrics Documentation lists all Prometheus metrics emitted by various components. Each metric entry includes its type, collection frequency, labels, and a description. This is useful for:

  • Building dashboards
  • Analyzing component behavior
  • Understanding what instrumentation is available

Alerts

The Alerts Documentation describes all alert rules configured in our Prometheus setup. Alerts are grouped by component and include severity levels, trigger conditions, and human-readable descriptions.

Use this to:

  • Understand why an alert fired
  • Debug active incidents
  • Tune alert sensitivity or thresholds

Threshold Reference

Some alerts reference templated threshold values from our Helm charts. These are documented separately in the Threshold Reference page.