Compute Node Metrics

This page documents Prometheus metrics emitted by Compute Nodes, which are the processes that execute distributed query fragments and manage data movement in Yellowbrick's architecture.

Purpose

These metrics provide detailed visibility into the health, stability, and behavior of individual compute nodes. They are used to:

Monitor compute nodes uptime and crash patterns
Diagnose issues with heartbeat timing and object store I/O
Detect exit codes and termination reasons (e.g., out-of-memory, unrecoverable signals)
Track YRD traffic, loader cache usage, and time synchronization skew

These insights are vital for debugging compute nodes instability, verifying cluster coordination, and building high-reliability monitoring dashboards.

Metrics

Name	Type	Freq	Labels	Description
`yb_heartbeat_recv_ms_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Milliseconds from when a Heartbeat was sent until response
`yb_heartbeat_send_ms_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Milliseconds from when a Heartbeat was scheduled until sent
`yb_lime_heartbeat_elapsed_time_percent`	histogram	10s	cluster, worker_id	Elapsed time in percent of the heartbeat timeout
`yb_lime_heartbeat_error_total`	counter	10s	cluster, worker_id	Total number of heartbeat errors
`yb_lime_loader_cache_available_size`	gauge	10s	cluster	Estimated minimum available loader cache space across all workers in the compute cluster
`yb_obj_store_recv`	gauge	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Object store HTTP bytes received
`yb_obj_store_recv_fail_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Object store socket failed recv calls
`yb_obj_store_send`	gauge	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Object store HTTP bytes sent
`yb_obj_store_send_fail_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Object store socket failed send calls
`yb_timeout_queue_late_ms_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	Milliseconds the Timeout Queue was late
`yb_tsc_skew1000_percent_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	1000x Percent Skew between TSC and Timespec
`yb_worker_exit_code_cluster_reset`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to CLUSTER_RESET
`yb_worker_exit_code_configure_not_quiesced`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to CONFIGURE_NOT_QUIESCED
`yb_worker_exit_code_general_error`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to GENERAL_ERROR
`yb_worker_exit_code_ib_connection_down`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to IB_CONNECTION_DOWN
`yb_worker_exit_code_minidump_exception`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to MINIDUMP_EXCEPTION
`yb_worker_exit_code_minidump_repeated`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to MINIDUMP_REPEATED
`yb_worker_exit_code_numa_oom`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to NUMA_OUT_OF_MEMORY
`yb_worker_exit_code_other_reason`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to OTHER_REASON
`yb_worker_exit_code_other_signal`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to some other signal
`yb_worker_exit_code_recopy_worker`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to RECOPY_WORKER
`yb_worker_exit_code_sigabrt`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGABRT
`yb_worker_exit_code_sigalrm`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGALRM
`yb_worker_exit_code_sigbus`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGBUS
`yb_worker_exit_code_sigfpe`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGFPE
`yb_worker_exit_code_sighup`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGHUP
`yb_worker_exit_code_sigill`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGILL
`yb_worker_exit_code_sigint`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGINT
`yb_worker_exit_code_sigkill`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGKILL
`yb_worker_exit_code_sigpipe`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGPIPE
`yb_worker_exit_code_sigsegv`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGSEGV (segmentation fault)
`yb_worker_exit_code_sigtrap`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to signal SIGTRAP
`yb_worker_exit_code_success`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to SUCCESS
`yb_worker_exit_code_unknown`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to unknown
`yb_worker_exit_code_ybd_assert`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	exit due to YBD_ASSERT
`yb_worker_last_exit_code`	gauge	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	The last exit code or -1
`yb_worker_uptime_sec`	counter	10s	instance_uuid, cluster, worker_logical_id, worker_uuid	Worker uptime in seconds
`yb_yrd_re_tx_bytes_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	YRD Bytes re-transmitted
`yb_yrd_tx_bytes_total`	counter	10s	index, instance_uuid, cluster, worker_logical_id, worker_uuid	YRD Bytes transmitted

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause

Compute Node Metrics

Purpose

Metrics