Appearance
Compute Node Metrics
This page documents Prometheus metrics emitted by Compute Nodes, which are the processes that execute distributed query fragments and manage data movement in Yellowbrick's architecture.
Purpose
These metrics provide detailed visibility into the health, stability, and behavior of individual compute nodes. They are used to:
- Monitor compute nodes uptime and crash patterns
- Diagnose issues with heartbeat timing and object store I/O
- Detect exit codes and termination reasons (e.g., out-of-memory, unrecoverable signals)
- Track YRD traffic, loader cache usage, and time synchronization skew
These insights are vital for debugging compute nodes instability, verifying cluster coordination, and building high-reliability monitoring dashboards.
Metrics
| Name | Type | Freq | Labels | Description |
|---|---|---|---|---|
yb_heartbeat_recv_ms_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Milliseconds from when a Heartbeat was sent until response |
yb_heartbeat_send_ms_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Milliseconds from when a Heartbeat was scheduled until sent |
yb_lime_heartbeat_elapsed_time_percent | histogram | 10s | cluster, worker_id | Elapsed time in percent of the heartbeat timeout |
yb_lime_heartbeat_error_total | counter | 10s | cluster, worker_id | Total number of heartbeat errors |
yb_lime_loader_cache_available_size | gauge | 10s | cluster | Estimated minimum available loader cache space across all workers in the compute cluster |
yb_obj_store_recv | gauge | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Object store HTTP bytes received |
yb_obj_store_recv_fail_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Object store socket failed recv calls |
yb_obj_store_send | gauge | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Object store HTTP bytes sent |
yb_obj_store_send_fail_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Object store socket failed send calls |
yb_timeout_queue_late_ms_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | Milliseconds the Timeout Queue was late |
yb_tsc_skew1000_percent_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | 1000x Percent Skew between TSC and Timespec |
yb_worker_exit_code_cluster_reset | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to CLUSTER_RESET |
yb_worker_exit_code_configure_not_quiesced | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to CONFIGURE_NOT_QUIESCED |
yb_worker_exit_code_general_error | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to GENERAL_ERROR |
yb_worker_exit_code_ib_connection_down | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to IB_CONNECTION_DOWN |
yb_worker_exit_code_minidump_exception | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to MINIDUMP_EXCEPTION |
yb_worker_exit_code_minidump_repeated | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to MINIDUMP_REPEATED |
yb_worker_exit_code_numa_oom | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to NUMA_OUT_OF_MEMORY |
yb_worker_exit_code_other_reason | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to OTHER_REASON |
yb_worker_exit_code_other_signal | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to some other signal |
yb_worker_exit_code_recopy_worker | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to RECOPY_WORKER |
yb_worker_exit_code_sigabrt | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGABRT |
yb_worker_exit_code_sigalrm | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGALRM |
yb_worker_exit_code_sigbus | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGBUS |
yb_worker_exit_code_sigfpe | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGFPE |
yb_worker_exit_code_sighup | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGHUP |
yb_worker_exit_code_sigill | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGILL |
yb_worker_exit_code_sigint | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGINT |
yb_worker_exit_code_sigkill | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGKILL |
yb_worker_exit_code_sigpipe | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGPIPE |
yb_worker_exit_code_sigsegv | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGSEGV (segmentation fault) |
yb_worker_exit_code_sigtrap | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to signal SIGTRAP |
yb_worker_exit_code_success | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to SUCCESS |
yb_worker_exit_code_unknown | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to unknown |
yb_worker_exit_code_ybd_assert | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | exit due to YBD_ASSERT |
yb_worker_last_exit_code | gauge | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | The last exit code or -1 |
yb_worker_uptime_sec | counter | 10s | instance_uuid, cluster, worker_logical_id, worker_uuid | Worker uptime in seconds |
yb_yrd_re_tx_bytes_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | YRD Bytes re-transmitted |
yb_yrd_tx_bytes_total | counter | 10s | index, instance_uuid, cluster, worker_logical_id, worker_uuid | YRD Bytes transmitted |