Data Distribution

Introduction to Data Distribution in MPP Data Warehouses

In Massively Parallel Processing (MPP) data warehouses like Yellowbrick, data distribution is a core architectural concept. It determines how rows of a table are spread across multiple compute nodes in the cluster.

Each compute node in an MPP system operates independently and processes its portion of the data in parallel with others. Effective data distribution ensures balanced workloads, minimal data movement during query execution, and fast, scalable performance.

MPP databases typically distribute data using one of the following strategies:

Hash Distribution

Hash distribution assigns rows to nodes based on the hash of one column. This is the most common and effective distribution method for large fact tables.

Benefits:

Enables co-location of related rows across tables when the same distribution key is used.
Minimizes runtime data movement during joins and aggregations.
Distributes data evenly when the key has high cardinality and low skew.

Example:

sql

CREATE TABLE sales (
  customer_id INT,
  order_id INT,
  amount DECIMAL(10,2)
)
DISTRIBUTE ON (customer_id);

Random (Round-Robin) Distribution

Random distribution assigns rows to nodes arbitrarily, typically in a round-robin fashion.

When to use:

When there’s no obvious column with high cardinality and low skew.
For small dimension tables where even distribution is more important than co-location.

Tradeoffs:

Even data distribution, but less efficient joins, since related rows may be on different nodes.
May lead to redistribution at runtime if joining to a hash-distributed table.

Replicated Distribution

Replicated distribution copies the entire table to every compute node.

Best for:

Small lookup or reference tables that are frequently joined to large fact tables.
Avoiding data movement during joins, since each node has local access.

Important considerations:

Should only be used for small tables, as replication adds memory and network overhead.
Inefficient and potentially harmful for large datasets.

INFO

There's an interesting edge-case in Yellowbrick when every table involved in a query is replicated. In this instance, the query execution is not parallelized across nodes. Instead, execution takes place on a single node only. For workloads involving small tables and requiring high query throughput, the use of all-replicated tables may increase query concurrency.

Data Distribution in Kubernetes Deployments

Data distribution is subtly different in Kubernetes deployments, where compute is separated from storage, compared to appliance deployments. In Kubernetes deployments the shard files are stored centrally in object storage and are copied to the local NVMe caches on each compute node at query execution time. Which compute node receives which shard is determined when the data is loaded into Yellowbrick in the case of a hash-distributed table, and is related to the number of nodes in the cluster. This means that if the cluster size changes after data is loaded, the data becomes mis-distributed with respect to the new cluster configuration, and extra overhead may be incurred during query. We recommend that data is always loaded and queried by clusters containing the same number of nodes.

Why Distribution Strategy Matters

Choosing the right distribution strategy impacts:

Query performance – Co-located joins are faster than distributed joins.
Workload balance – Poor distribution can lead to execution skew where one node does more work.
Storage efficiency – At-rest data skew can cause uneven disk usage and maintenance challenges.
Network overhead – Mismatched distribution keys result in runtime redistribution, which slows down execution.

In summary, thoughtful table distribution is essential for optimizing performance and resource utilization in Yellowbrick and other MPP systems. Always consider the size, join patterns, and cardinality of your data when selecting a distribution method.

Read the Distributing Data How-To Guide for practical guidance and examples.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause