Data Distribution Best Practices

Introduction

This document provides a guide to designing effective table distributions in Yellowbrick. By following these examples, you can address common challenges such as data skew and workload imbalances, optimizing query execution and resource utilization. The focus is on aligning table distribution with workload patterns to reduce runtime data redistribution and enhance performance.

Read Data Distribution Concepts to understand why data distribution matters in Yellowbrick.

Before You Begin

Effective planning is essential when designing table distributions in Yellowbrick. Before selecting a distribution key, it is important to address common misconceptions and evaluate workload characteristics.

INFO

If a table is already loaded in Yellowbrick and requires design optimization, use tools like YbEasyCli to query table and column metadata. These tools provide insights into key attributes, including distribution patterns and column characteristics, to help refine table distribution strategies.

Avoid Common Misconceptions

Proper understanding of table distribution in Yellowbrick is critical to avoid common misconceptions that can result in suboptimal performance. The following sections address two key misunderstandings:

Misconception 1: Even Distribution of Data is the Most Important Criteria

SQL workload characteristics determine the effectiveness of a distribution strategy, not just the uniformity of data distribution.

Characteristics	Description
Common Join and Grouping Columns	Distribution should align with frequently joined or grouped columns to minimize redistribution.
Cardinality of Key Columns	High cardinality columns are preferred to ensure balanced data distribution.
Filtering Predicates	Predicates in queries should align with the distribution key to improve data locality.

Uniformly distributed data alone does not guarantee optimal performance. Redistribution during query execution may negate the benefits of uniform storage if the distribution scheme does not align with workload patterns.

Misconception 2: Distribution Only Matters for Large Tables

Small tables are often overlooked during table design, but in reality, distribution on small tables can have a significant performance impact on join order and redistribution in execution plans.

Distribution should be explicitly set for both large and small tables.

Evaluate Your Workload

When determining a table’s distribution, the goal is to achieve balanced data and workload distribution across compute nodes while minimizing data redistribution during query execution. A well-planned distribution enhances both performance and resource utilization.

To optimize table distribution, focus on the following principles:

Principle	Description
Avoid Execution Skew	Avoid situations causing “execution skew,” where some compute nodes process disproportionately more data than others. A step in a process will only be as fast as that process on the slowest compute node.
Avoid At-Rest Data Skew	Prevent “at-rest data skew,” where one or more compute nodes store significantly more data than others. In an MPP system, storage is limited by the compute node with the least available space.
Minimize Runtime Redistribution	Reduce movement of data between compute nodes during statement execution to minimize I/O and performance costs. At-rest distribution is the most common source of runtime redistribution.

Gather the following details about your tables to design an effective distribution strategy:

Consideration	Guidance
Table Type	Determine if the table is a fact table or a dimension table.
Size	Assess the number of rows and the uncompressed size of the table to gauge its impact on storage and performance.
For Fact Tables	1. Identify frequent join columns, focusing on those commonly used in joins with other fact tables or large dimension tables. 2. Focus on primary and foreign keys relationships in the data. 3. Evaluate the cardinality and nullability of key columns to avoid skew caused by low cardinality or `NULL` values. For a detailed discussion, refer to the section Distributing by Fact Tables.

Distributing Small and Frequently Used Dimension Tables with DISTRIBUTE REPLICATE

The DISTRIBUTE REPLICATE method, in the use cases below, ensures that all compute nodes maintain a complete copy of the table, eliminating the need for redistributions during query execution.

Use Cases

Use Case	Description
Small Dimension Tables	Suitable for tables under 1 GB (uncompressed) that are used in joins or `GROUP BY` operations.
Larger Dimension Tables	Applicable for tables larger than 1 GB (uncompressed) but frequently joined on different columns. This method should be used cautiously for larger tables, as it increases storage requirements and may impact scalability.

Advantages

Minimizes Data Redistribution: DISTRIBUTE REPLICATE eliminates the need for data movement across compute nodes, reducing query overhead by ensuring all nodes have a complete copy of the table.
Efficient for Filtering and Non-Filtering Joins: By avoiding runtime redistributions, replicated tables enable low-cost execution of both filtering joins and non-filtering joins, as all required data is already present on each node.

Disadvantages

Increased Storage Requirements: Each compute node must store a full copy of the table, potentially impacting system scalability, especially for larger datasets.

Distributing Fact Tables with Common Join Columns

For fact tables that are frequently joined to other fact or large dimension tables, selecting the appropriate distribution column is critical for achieving optimal performance.

Below are guidelines to help determine the best approach for your scenario:

Guideline	Description
Focus on Common Joins	Select the column most frequently used in joins with other fact or dimension tables. Tip: Identify primary key and foreign key relationships in your source data.
Choose Based on the filtered data set	Select the distribution column considering the size of the data set after filtering, rather than the total size of the table. Example: If a large dimension table is heavily filtered before a join, focus on the rows that remain after filtering.
Use High Cardinality Columns	Choose a distribution column with high cardinality in the filtered result set. Example: A `sales` table distributed on a `date` column may seem balanced, but frequent filtering for narrow date ranges can cause skew.
Consider Surrogate Keys for Multi-Column Joins	For multi-column joins, consider a surrogate (synthetic) distribution key that captures join patterns effectively. Example: If low-cardinality columns like `customer_id` and `product_id` are part of the join, a synthetic key (e.g., combining these columns) can improve distribution balance.

Distributing Fact Tables with Common Grouping or Partition Columns

In scenarios where a fact table is not frequently joined with other fact or large dimension tables but is heavily used in GROUP BY queries or partitioned window functions, distributing the table on common grouping or partition columns can enhance performance significantly.

The table below demonstrates two possible approaches for this distribution type:

Consideration Guidance

Choose the Most Common Grouping or Partition Column Select the column most frequently used in GROUP BY or window function partitioning. Ensure the chosen column has high cardinality to avoid data skew.

Use a Surrogate (Synthetic) Key for Multi-Column Grouping Create a synthetic key for tables requiring grouping or partitioning on multiple columns. This approach helps mitigate issues of low cardinality and high skew in individual columns by combining them into a balanced distribution. Example: Combine customer_id and region_id into a synthetic key: ALTER TABLE sales ADD COLUMN customer_region_key BIGINT; UPDATE sales SET customer_region_key = customer_id * 1000 + region_id;

Consideration	Guidance
Choose the Most Common Grouping or Partition Column	Select the column most frequently used in `GROUP BY` or window function partitioning. Ensure the chosen column has high cardinality to avoid data skew.
Use a Surrogate (Synthetic) Key for Multi-Column Grouping	Create a synthetic key for tables requiring grouping or partitioning on multiple columns. This approach helps mitigate issues of low cardinality and high skew in individual columns by combining them into a balanced distribution. Example: Combine `customer_id` and `region_id` into a synthetic key: `ALTER TABLE sales ADD COLUMN customer_region_key BIGINT; UPDATE sales SET customer_region_key = customer_id * 1000 + region_id;`

Using DISTRIBUTE RANDOM as Last Option

DISTRIBUTE RANDOM is a fallback option for tables that are neither commonly joined nor aggregated on a specific column. This method evenly distributes rows across all compute nodes without considering logical groupings.

When to Use

DISTRIBUTE RANDOM	Description
Uncommon Use Cases	Applicable for tables that lack a suitable column for `DISTRIBUTE ON` and are not part of frequent joins or aggregations.
Last Resort	Use only if distributing on any column would lead to high skew.

Alternative: Using Surrogate (Synthetic) Keys

Before resorting to DISTRIBUTE RANDOM, consider creating a surrogate (synthetic) key for distribution. This approach can better align with query patterns while avoiding skew by combining multiple columns or transforming data to achieve a balanced distribution.
Example:

sql

ALTER TABLE example_table ADD COLUMN synthetic_key BIGINT;
UPDATE example_table SET synthetic_key = col1 * 1000 + col2;

Key Takeaways

Use DISTRIBUTE REPLICATE strategically for dimension tables with data size less than 1 GB.
Distribute tables strategically based on common join, grouping, or partitioning patterns to minimize data movement and optimize query execution. For multi-column groupings, synthetic keys are often the most effective choice for achieving even distribution.
DISTRIBUTE RANDOM should be avoided. Instead, using surrogate keys may align better with query patterns and help avoid skew.

Solutions to Common DISTRIBUTE ON Problems

Here are practical solutions to common challenges faced when using DISTRIBUTE ON.

Materialize Joins and Grouping Expressions on the Distribution Column

Joins and groupings on expressions always result in a redistribution, even if the data is redistributed to the same node and the expression is on the distribution column. This leads to costly execution due to both the redistribution of data and the query planner’s inability to estimate the cardinality of the expression accurately.

Example Query

sql

    -- ! WARNING !
    -- Avoid this query structure as it forces a 
    -- full redistribution of both tables during execution.
    SELECT *
    FROM t1
    JOIN t2
    ON UPPER(t1.c1) = UPPER(t2.c1);

Possible Issues

Issues	Description
Data Redistribution	Rows are unnecessarily moved between compute nodes.
Query Planner Limitations	The query planner cannot estimate the cardinality of the `UPPER()` expression, resulting in suboptimal execution plans.

Solution

If a join or grouping on an expression is common, the preferred approach is to materialize the expression into a new column and distribute the table using that column.

Steps For Small Tables

Step	Description	SQL
1	Rename the original table to retain the data.	`ALTER TABLE t1 RENAME TO t1_old;`
2	Create a new table with the materialized expression.	`CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old;`
3	Use the materialized column in joins and groupings.	`SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc;`

Steps For Large Tables

Step	Description	SQL
1	Rename the original table to retain the data.	`ALTER TABLE t1 RENAME TO t1_old;`
2	Create a new table without data.	`CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old WITH NO DATA;`
3	Insert data into the new table in chunks.	`INSERT INTO t1 SELECT *, UPPER(c1) AS c1_uc FROM t1_old;`
4	Use the materialized column in joins and groupings.	`SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc;`

Use a Surrogate Distribution Key for Multi-Column Distribution

For tables with frequent multi-column joins, groupings, or aggregations, using a surrogate distribution key can improve performance by reducing skew and optimizing data placement. This approach is particularly useful when individual columns have low cardinality or high skew but their combination results in a better distribution.

Example Queries

Query Type	Example
Aggregation	`SELECT prod_id, loc_id, SUM(price) AS price FROM line_items GROUP BY 1, 2 ORDER BY 1, 2 LIMIT 10;`
Join	`SELECT prod_id, loc_id, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_id, loc_id);`

Possible Issues

Low Cardinality or High Skew	Individual columns like `prod_id` and `loc_id` may result in uneven data distribution.
Complex Distribution for Multi-Column Joins	Low cardinality or high skew in individual columns (such as prod_id and loc_id from the examples above) can result in uneven data distribution across compute nodes. This imbalance affects the efficiency of joins and aggregations.

Solution

Create a surrogate distribution key by combining columns into a composite key.

Steps

Step	Action	SQL
1	Rename the original table to retain the data	`ALTER TABLE line_items RENAME TO line_items_old;`
2	Create a new table with the surrogate key	`CREATE TABLE line_items AS SELECT , (prod_id 2^31 + loc_id) AS prod_loc FROM line_items_old;`
3	Update queries to use the surrogate key for aggregation	`SELECT prod_loc, SUM(price) AS price FROM line_items GROUP BY prod_loc ORDER BY prod_loc LIMIT 10;`
4	Update queries to use the surrogate key for joins	`SELECT prod_loc, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_loc);`

Benefits of Surrogate Keys

Benefit	Description
Reduced Skew	Combines columns to distribute data more evenly across compute nodes.
Optimized Query Performance	Avoids redistributions during joins and groupings.
Simplified Distribution Design	Provides a single key that aligns with query patterns, simplifying data management and optimization.

Use “Fake” Values for High NULL Skew

Columns with a high percentage of NULL values can lead to significant at-rest data skew when used as distribution keys. This imbalance occurs because NULL values are treated as a single value, causing a large portion of rows to be stored on a single compute node.

Example Query

sql

    SELECT * 
    FROM line_items li 
    JOIN location l 
    ON li.loc_id = l.loc_id;

Possible Issues

Issue	Description
High NULL Skew	Columns with many `NULL` values cause uneven data distribution, with most rows stored on one compute node.
At-Rest Skew	Rows with `NULL` values are disproportionately assigned to a single compute node, leading to imbalance.

Solution

Replace NULL values with a unique “fake” value to distribute data evenly.

Steps

Step	Description	Example
1	Identify a unique “fake” value that does not match any existing data in the related table.	Example values: `-999999`, negative timestamps, or constants.
2	Replace `NULL` values in the column with the fake value.	`UPDATE line_items SET loc_id = -999999 WHERE loc_id IS NULL;`

Benefits of Using “Fake” Values

Benefit	Description
Eliminates At-Rest Skew	Distributes rows with `NULL` values evenly across compute nodes.
Maintains Query Integrity	Ensures queries that exclude `NULL` values remain accurate without additional adjustments.
Avoids Runtime Skew	Prevents performance issues during execution caused by unbalanced data distribution.

Summary

Effective table distribution design is a cornerstone of performance optimization in Yellowbrick’s MPP system. By adhering to these best practices, users can address challenges such as data skew, balance storage and workload distribution, and ensure efficient query execution.

Key Takeaways

Takeaway	Description
Analyze Workloads	Define distribution strategies based on common query patterns, join operations, and data characteristics.
Use High-Cardinality Keys	Minimize skew by selecting high-cardinality distribution keys or surrogate keys for multi-column joins.
Manage `NULL` Skew	Replace `NULL` values with unique placeholders to avoid skew where appropriate.
Apply `DISTRIBUTE REPLICATE`	Use for small dimension tables to eliminate redistributions.
Avoid `DISTRIBUTE RANDOM`	Use only when no suitable distribution key exists, as it can lead to inefficiencies in query execution.

By designing table distributions that align with workload requirements and revisiting these strategies over time, users can achieve balanced workloads, reduced data movement, and sustained system performance.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause

Data Distribution Best Practices

Introduction

Before You Begin

Avoid Common Misconceptions

Misconception 1: Even Distribution of Data is the Most Important Criteria

Misconception 2: Distribution Only Matters for Large Tables

Evaluate Your Workload

Distributing Small and Frequently Used Dimension Tables with DISTRIBUTE REPLICATE

Use Cases

Advantages

Disadvantages

Distributing Fact Tables with Common Join Columns

Distributing Fact Tables with Common Grouping or Partition Columns

Using DISTRIBUTE RANDOM as Last Option

When to Use

Alternative: Using Surrogate (Synthetic) Keys

Key Takeaways

Solutions to Common DISTRIBUTE ON Problems

Materialize Joins and Grouping Expressions on the Distribution Column

Example Query

Possible Issues

Solution

Steps For Small Tables

Steps For Large Tables

Use a Surrogate Distribution Key for Multi-Column Distribution

Example Queries

Possible Issues