Skip to content

Data Distribution

Introduction to Data Distribution in MPP Data Warehouses

In Massively Parallel Processing (MPP) data warehouses like Yellowbrick, data distribution is a core architectural concept. It determines how rows of a table are spread across multiple compute nodes in the cluster.

Each compute node in an MPP system operates independently and processes its portion of the data in parallel with others. Effective data distribution ensures balanced workloads, minimal data movement during query execution, and fast, scalable performance.

MPP databases typically distribute data using one of the following strategies:

Hash Distribution

Hash distribution assigns rows to nodes based on the hash of one column. This is the most common and effective distribution method for large fact tables.

Benefits:

  • Enables co-location of related rows across tables when the same distribution key is used.
  • Minimizes runtime data movement during joins and aggregations.
  • Distributes data evenly when the key has high cardinality and low skew.

Example:

sql
CREATE TABLE sales (
  customer_id INT,
  order_id INT,
  amount DECIMAL(10,2)
)
DISTRIBUTE ON (customer_id);

Random (Round-Robin) Distribution

Random distribution assigns rows to nodes arbitrarily, typically in a round-robin fashion.

When to use:

  • When there’s no obvious column with high cardinality and low skew.
  • For small dimension tables where even distribution is more important than co-location.

Tradeoffs:

  • Even data distribution, but less efficient joins, since related rows may be on different nodes.
  • May lead to redistribution at runtime if joining to a hash-distributed table.

Replicated Distribution

Replicated distribution copies the entire table to every compute node.

Best for:

  • Small lookup or reference tables that are frequently joined to large fact tables.
  • Avoiding data movement during joins, since each node has local access.

Important considerations:

  • Should only be used for small tables, as replication adds memory and network overhead.
  • Inefficient and potentially harmful for large datasets.

INFO

There's an interesting edge-case in Yellowbrick when every table involved in a query is replicated. In this instance, the query execution is not parallelized across nodes. Instead, execution takes place on a single node only. For workloads involving small tables and requiring high query throughput, the use of all-replicated tables may increase query concurrency.

Data Distribution in Kubernetes Deployments

Data distribution is subtly different in Kubernetes deployments, where compute is separated from storage, compared to appliance deployments. In Kubernetes deployments the shard files are stored centrally in object storage and are copied to the local NVMe caches on each compute node at query execution time. Which compute node receives which shard is determined when the data is loaded into Yellowbrick in the case of a hash-distributed table, and is related to the number of nodes in the cluster. This means that if the cluster size changes after data is loaded, the data becomes mis-distributed with respect to the new cluster configuration, and extra overhead may be incurred during query. We recommend that data is always loaded and queried by clusters containing the same number of nodes.

Why Distribution Strategy Matters

Choosing the right distribution strategy impacts:

  • Query performance – Co-located joins are faster than distributed joins.
  • Workload balance – Poor distribution can lead to execution skew where one node does more work.
  • Storage efficiency – At-rest data skew can cause uneven disk usage and maintenance challenges.
  • Network overhead – Mismatched distribution keys result in runtime redistribution, which slows down execution.

In summary, thoughtful table distribution is essential for optimizing performance and resource utilization in Yellowbrick and other MPP systems. Always consider the size, join patterns, and cardinality of your data when selecting a distribution method.

Read the Distributing Data How-To Guide for practical guidance and examples.