Skip to content

Compute Node Metrics

This page documents Prometheus metrics emitted by Compute Nodes, which are the processes that execute distributed query fragments and manage data movement in Yellowbrick's architecture.

Purpose

These metrics provide detailed visibility into the health, stability, and behavior of individual compute nodes. They are used to:

  • Monitor compute nodes uptime and crash patterns
  • Diagnose issues with heartbeat timing and object store I/O
  • Detect exit codes and termination reasons (e.g., out-of-memory, unrecoverable signals)
  • Track YRD traffic, loader cache usage, and time synchronization skew

These insights are vital for debugging compute nodes instability, verifying cluster coordination, and building high-reliability monitoring dashboards.

Metrics

NameTypeFreqLabelsVersion IntroducedVersion DeprecatedDescription
yb_heartbeat_recv_ms_countcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was sent until response (count of samples since last worker start)
yb_heartbeat_recv_ms_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was sent until response (maximum sample since last worker start)
yb_heartbeat_recv_ms_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was sent until response (minimum sample since last worker start)
yb_heartbeat_recv_ms_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was sent until response (sum since last worker start)
yb_heartbeat_send_ms_countcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was scheduled until sent (count of samples since last worker start)
yb_heartbeat_send_ms_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was scheduled until sent (maximum sample since last worker start)
yb_heartbeat_send_ms_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was scheduled until sent (minimum sample since last worker start)
yb_heartbeat_send_ms_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds from when a Heartbeat was scheduled until sent (sum since last worker start)
yb_lime_heartbeat_elapsed_time_percenthistogram10scluster, cluster_name, worker_id7.3.0-Elapsed time in percent of the heartbeat timeout
yb_lime_heartbeat_error_totalcounter10scluster, cluster_name, worker_id7.3.0-Total number of heartbeat errors
yb_lime_loader_cache_available_sizegauge10scluster, cluster_name7.3.0-Estimated minimum available loader cache space across all workers in the compute cluster
yb_obj_store_recv_count_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes received (count of samples since last worker start)
yb_obj_store_recv_fail_totalcounter10scluster, worker_logical_id, worker_uuid7.3.0-Object store socket failed receive calls
yb_obj_store_recv_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes received (maximum sample since last worker start)
yb_obj_store_recv_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes received (minimum sample since last worker start)
yb_obj_store_recv_sum_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes received (sum since last worker start)
yb_obj_store_send_count_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes sent (count of samples since last worker start)
yb_obj_store_send_fail_totalcounter10scluster, worker_logical_id, worker_uuid7.3.0-Object store socket failed send calls
yb_obj_store_send_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes sent (maximum sample since last worker start)
yb_obj_store_send_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes sent (minimum sample since last worker start)
yb_obj_store_send_sum_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Object store HTTP bytes sent (sum since last worker start)
yb_timeout_queue_late_ms_countcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds the timeout queue was late (count of samples since last worker start)
yb_timeout_queue_late_ms_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds the timeout queue was late (maximum sample since last worker start)
yb_timeout_queue_late_ms_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds the timeout queue was late (minimum sample since last worker start)
yb_timeout_queue_late_ms_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-Milliseconds the timeout queue was late (sum since last worker start)
yb_tsc_skew1000_percent_countcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-1000x Percent Skew between TSC and Timespec (count of samples since last worker start)
yb_tsc_skew1000_percent_maxgauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-1000x Percent Skew between TSC and Timespec (maximum sample since last worker start)
yb_tsc_skew1000_percent_mingauge10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-1000x Percent Skew between TSC and Timespec (minimum sample since last worker start)
yb_tsc_skew1000_percent_totalcounter10scluser, cluster_name, worker_logical_id, worker_uuid7.3.0-1000x Percent Skew between TSC and Timespec (sum since last worker start)
yb_worker_exit_code_cluster_resetcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to CLUSTER_RESET
yb_worker_exit_code_configure_not_quiescedcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to CONFIGURE_NOT_QUIESCED
yb_worker_exit_code_general_errorcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to GENERAL_ERROR
yb_worker_exit_code_ib_connection_downcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to IB_CONNECTION_DOWN
yb_worker_exit_code_minidump_exceptioncounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to MINIDUMP_EXCEPTION
yb_worker_exit_code_minidump_repeatedcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to MINIDUMP_REPEATED
yb_worker_exit_code_numa_oomcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to NUMA_OUT_OF_MEMORY
yb_worker_exit_code_other_reasoncounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to OTHER_REASON
yb_worker_exit_code_other_signalcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to some other signal
yb_worker_exit_code_recopy_workercounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to RECOPY_WORKER
yb_worker_exit_code_sigabrtcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGABRT
yb_worker_exit_code_sigalrmcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGALRM
yb_worker_exit_code_sigbuscounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGBUS
yb_worker_exit_code_sigfpecounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGFPE
yb_worker_exit_code_sighupcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGHUP
yb_worker_exit_code_sigillcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGILL
yb_worker_exit_code_sigintcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGINT
yb_worker_exit_code_sigkillcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGKILL
yb_worker_exit_code_sigpipecounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGPIPE
yb_worker_exit_code_sigsegvcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGSEGV (segmentation fault)
yb_worker_exit_code_sigtrapcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to signal SIGTRAP
yb_worker_exit_code_successcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to SUCCESS
yb_worker_exit_code_unknowncounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to unknown
yb_worker_exit_code_ybd_assertcounter10scluster, worker_logical_id, worker_uuid7.3.0-exit due to YBD_ASSERT
yb_worker_last_exit_codegauge10scluster, worker_logical_id, worker_uuid7.3.0-The last exit code or -1
yb_worker_uptime_seccounter10scluster, worker_logical_id, worker_uuid7.3.0-Worker uptime in seconds