Skip to content

Compute Node Metrics

This page documents Prometheus metrics emitted by Compute Nodes, which are the processes that execute distributed query fragments and manage data movement in Yellowbrick's architecture.

Purpose

These metrics provide detailed visibility into the health, stability, and behavior of individual compute nodes. They are used to:

  • Monitor compute nodes uptime and crash patterns
  • Diagnose issues with heartbeat timing and object store I/O
  • Detect exit codes and termination reasons (e.g., out-of-memory, unrecoverable signals)
  • Track YRD traffic, loader cache usage, and time synchronization skew

These insights are vital for debugging compute nodes instability, verifying cluster coordination, and building high-reliability monitoring dashboards.

Metrics

NameTypeFreqLabelsDescription
yb_heartbeat_recv_ms_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidMilliseconds from when a Heartbeat was sent until response
yb_heartbeat_send_ms_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidMilliseconds from when a Heartbeat was scheduled until sent
yb_lime_heartbeat_elapsed_time_percenthistogram10scluster, worker_idElapsed time in percent of the heartbeat timeout
yb_lime_heartbeat_error_totalcounter10scluster, worker_idTotal number of heartbeat errors
yb_lime_loader_cache_available_sizegauge10sclusterEstimated minimum available loader cache space across all workers in the compute cluster
yb_obj_store_recvgauge10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidObject store HTTP bytes received
yb_obj_store_recv_fail_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidObject store socket failed recv calls
yb_obj_store_sendgauge10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidObject store HTTP bytes sent
yb_obj_store_send_fail_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidObject store socket failed send calls
yb_timeout_queue_late_ms_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidMilliseconds the Timeout Queue was late
yb_tsc_skew1000_percent_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuid1000x Percent Skew between TSC and Timespec
yb_worker_exit_code_cluster_resetcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to CLUSTER_RESET
yb_worker_exit_code_configure_not_quiescedcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to CONFIGURE_NOT_QUIESCED
yb_worker_exit_code_general_errorcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to GENERAL_ERROR
yb_worker_exit_code_ib_connection_downcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to IB_CONNECTION_DOWN
yb_worker_exit_code_minidump_exceptioncounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to MINIDUMP_EXCEPTION
yb_worker_exit_code_minidump_repeatedcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to MINIDUMP_REPEATED
yb_worker_exit_code_numa_oomcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to NUMA_OUT_OF_MEMORY
yb_worker_exit_code_other_reasoncounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to OTHER_REASON
yb_worker_exit_code_other_signalcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to some other signal
yb_worker_exit_code_recopy_workercounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to RECOPY_WORKER
yb_worker_exit_code_sigabrtcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGABRT
yb_worker_exit_code_sigalrmcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGALRM
yb_worker_exit_code_sigbuscounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGBUS
yb_worker_exit_code_sigfpecounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGFPE
yb_worker_exit_code_sighupcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGHUP
yb_worker_exit_code_sigillcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGILL
yb_worker_exit_code_sigintcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGINT
yb_worker_exit_code_sigkillcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGKILL
yb_worker_exit_code_sigpipecounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGPIPE
yb_worker_exit_code_sigsegvcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGSEGV (segmentation fault)
yb_worker_exit_code_sigtrapcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to signal SIGTRAP
yb_worker_exit_code_successcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to SUCCESS
yb_worker_exit_code_unknowncounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to unknown
yb_worker_exit_code_ybd_assertcounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidexit due to YBD_ASSERT
yb_worker_last_exit_codegauge10sinstance_uuid, cluster, worker_logical_id, worker_uuidThe last exit code or -1
yb_worker_uptime_seccounter10sinstance_uuid, cluster, worker_logical_id, worker_uuidWorker uptime in seconds
yb_yrd_re_tx_bytes_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidYRD Bytes re-transmitted
yb_yrd_tx_bytes_totalcounter10sindex, instance_uuid, cluster, worker_logical_id, worker_uuidYRD Bytes transmitted