Skip to content

Compute Node Metrics

This page documents Prometheus metrics emitted by Compute Nodes, which are the processes that execute distributed query fragments and manage data movement in Yellowbrick's architecture.

Purpose

These metrics provide detailed visibility into the health, stability, and behavior of individual compute nodes. They are used to:

  • Monitor compute nodes uptime and crash patterns
  • Diagnose issues with heartbeat timing and object store I/O
  • Detect exit codes and termination reasons (e.g., out-of-memory, unrecoverable signals)
  • Track YRD traffic, loader cache usage, and time synchronization skew

These insights are vital for debugging compute nodes instability, verifying cluster coordination, and building high-reliability monitoring dashboards.

Metrics

NameTypeFreqLabelsDescription
yb_heartbeat_recv_ms_totalcounter10scluster, worker_logical_id, worker_uuidMilliseconds from when a Heartbeat was sent until response
yb_heartbeat_send_ms_totalcounter10scluster, worker_logical_id, worker_uuidMilliseconds from when a Heartbeat was scheduled until sent
yb_lime_heartbeat_elapsed_time_percenthistogram10scluster, cluster_name, worker_idElapsed time in percent of the heartbeat timeout
yb_lime_heartbeat_error_totalcounter10scluster, cluster_name, worker_idTotal number of heartbeat errors
yb_lime_loader_cache_available_sizegauge10scluster, cluster_nameEstimated minimum available loader cache space across all workers in the compute cluster
yb_obj_store_recv_count_totalcounter10scluster, worker_logical_id, worker_uuidObject store HTTP bytes received (count of samples since last worker start)
yb_obj_store_recv_fail_totalcounter10scluster, worker_logical_id, worker_uuidObject store socket failed recv calls
yb_obj_store_recv_maxgauge10scluster, worker_logical_id, worker_uuidObject store HTTP bytes received (maximum sample since last worker start)
yb_obj_store_recv_mingauge10scluster, worker_logical_id, worker_uuidObject store HTTP bytes received (minimum sample since last worker start)
yb_obj_store_recv_sum_totalcounter10scluster, worker_logical_id, worker_uuidObject store HTTP bytes received (sum since last worker start)
yb_obj_store_send_count_totalcounter10scluster, worker_logical_id, worker_uuidObject store HTTP bytes sent (count of samples since last worker start)
yb_obj_store_send_fail_totalcounter10scluster, worker_logical_id, worker_uuidObject store socket failed send calls
yb_obj_store_send_maxgauge10scluster, worker_logical_id, worker_uuidObject store HTTP bytes sent (maximum sample since last worker start)
yb_obj_store_send_mingauge10scluster, worker_logical_id, worker_uuidObject store HTTP bytes sent (minimum sample since last worker start)
yb_obj_store_send_sum_totalcounter10scluster, worker_logical_id, worker_uuidObject store HTTP bytes sent (sum since last worker start)
yb_timeout_queue_late_ms_totalcounter10scluster, worker_logical_id, worker_uuidMilliseconds the Timeout Queue was late
yb_tsc_skew1000_percent_totalcounter10scluster, worker_logical_id, worker_uuid1000x Percent Skew between TSC and Timespec
yb_worker_exit_code_cluster_resetcounter10scluster, worker_logical_id, worker_uuidexit due to CLUSTER_RESET
yb_worker_exit_code_configure_not_quiescedcounter10scluster, worker_logical_id, worker_uuidexit due to CONFIGURE_NOT_QUIESCED
yb_worker_exit_code_general_errorcounter10scluster, worker_logical_id, worker_uuidexit due to GENERAL_ERROR
yb_worker_exit_code_ib_connection_downcounter10scluster, worker_logical_id, worker_uuidexit due to IB_CONNECTION_DOWN
yb_worker_exit_code_minidump_exceptioncounter10scluster, worker_logical_id, worker_uuidexit due to MINIDUMP_EXCEPTION
yb_worker_exit_code_minidump_repeatedcounter10scluster, worker_logical_id, worker_uuidexit due to MINIDUMP_REPEATED
yb_worker_exit_code_numa_oomcounter10scluster, worker_logical_id, worker_uuidexit due to NUMA_OUT_OF_MEMORY
yb_worker_exit_code_other_reasoncounter10scluster, worker_logical_id, worker_uuidexit due to OTHER_REASON
yb_worker_exit_code_other_signalcounter10scluster, worker_logical_id, worker_uuidexit due to some other signal
yb_worker_exit_code_recopy_workercounter10scluster, worker_logical_id, worker_uuidexit due to RECOPY_WORKER
yb_worker_exit_code_sigabrtcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGABRT
yb_worker_exit_code_sigalrmcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGALRM
yb_worker_exit_code_sigbuscounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGBUS
yb_worker_exit_code_sigfpecounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGFPE
yb_worker_exit_code_sighupcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGHUP
yb_worker_exit_code_sigillcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGILL
yb_worker_exit_code_sigintcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGINT
yb_worker_exit_code_sigkillcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGKILL
yb_worker_exit_code_sigpipecounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGPIPE
yb_worker_exit_code_sigsegvcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGSEGV (segmentation fault)
yb_worker_exit_code_sigtrapcounter10scluster, worker_logical_id, worker_uuidexit due to signal SIGTRAP
yb_worker_exit_code_successcounter10scluster, worker_logical_id, worker_uuidexit due to SUCCESS
yb_worker_exit_code_unknowncounter10scluster, worker_logical_id, worker_uuidexit due to unknown
yb_worker_exit_code_ybd_assertcounter10scluster, worker_logical_id, worker_uuidexit due to YBD_ASSERT
yb_worker_last_exit_codegauge10scluster, worker_logical_id, worker_uuidThe last exit code or -1
yb_worker_uptime_seccounter10scluster, worker_logical_id, worker_uuidWorker uptime in seconds