Skip to content

Data Distribution Best Practices

Introduction

This document provides a guide to designing effective table distributions in Yellowbrick. By following these examples, you can address common challenges such as data skew and workload imbalances, optimizing query execution and resource utilization. The focus is on aligning table distribution with workload patterns to reduce runtime data redistribution and enhance performance.

Read Data Distribution Concepts to understand why data distribution matters in Yellowbrick.

Before You Begin

Effective planning is essential when designing table distributions in Yellowbrick. Before selecting a distribution key, it is important to address common misconceptions and evaluate workload characteristics.

INFO

If a table is already loaded in Yellowbrick and requires design optimization, use tools like YbEasyCli to query table and column metadata. These tools provide insights into key attributes, including distribution patterns and column characteristics, to help refine table distribution strategies.

Avoid Common Misconceptions

Proper understanding of table distribution in Yellowbrick is critical to avoid common misconceptions that can result in suboptimal performance. The following sections address two key misunderstandings:

Misconception 1: Even Distribution of Data is the Most Important Criteria

SQL workload characteristics determine the effectiveness of a distribution strategy, not just the uniformity of data distribution.

CharacteristicsDescription
Common Join and Grouping ColumnsDistribution should align with frequently joined or grouped columns to minimize redistribution.
Cardinality of Key ColumnsHigh cardinality columns are preferred to ensure balanced data distribution.
Filtering PredicatesPredicates in queries should align with the distribution key to improve data locality.

Uniformly distributed data alone does not guarantee optimal performance. Redistribution during query execution may negate the benefits of uniform storage if the distribution scheme does not align with workload patterns.

Misconception 2: Distribution Only Matters for Large Tables

Small tables are often overlooked during table design, but in reality, distribution on small tables can have a significant performance impact on join order and redistribution in execution plans.

Distribution should be explicitly set for both large and small tables.

Evaluate Your Workload

When determining a table’s distribution, the goal is to achieve balanced data and workload distribution across compute nodes while minimizing data redistribution during query execution. A well-planned distribution enhances both performance and resource utilization.

To optimize table distribution, focus on the following principles:

PrincipleDescription
Avoid Execution SkewAvoid situations causing “execution skew,” where some compute nodes process disproportionately more data than others. A step in a process will only be as fast as that process on the slowest compute node.
Avoid At-Rest Data SkewPrevent “at-rest data skew,” where one or more compute nodes store significantly more data than others. In an MPP system, storage is limited by the compute node with the least available space.
Minimize Runtime RedistributionReduce movement of data between compute nodes during statement execution to minimize I/O and performance costs. At-rest distribution is the most common source of runtime redistribution.

Gather the following details about your tables to design an effective distribution strategy:

ConsiderationGuidance
Table TypeDetermine if the table is a fact table or a dimension table.
SizeAssess the number of rows and the uncompressed size of the table to gauge its impact on storage and performance.
For Fact Tables1. Identify frequent join columns, focusing on those commonly used in joins with other fact tables or large dimension tables. 2. Focus on primary and foreign keys relationships in the data. 3. Evaluate the cardinality and nullability of key columns to avoid skew caused by low cardinality or NULL values. For a detailed discussion, refer to the section Distributing by Fact Tables.

Distributing Small and Frequently Used Dimension Tables with DISTRIBUTE REPLICATE

The DISTRIBUTE REPLICATE method, in the use cases below, ensures that all compute nodes maintain a complete copy of the table, eliminating the need for redistributions during query execution.

Use Cases

Use CaseDescription
Small Dimension TablesSuitable for tables under 1 GB (uncompressed) that are used in joins or GROUP BY operations.
Larger Dimension TablesApplicable for tables larger than 1 GB (uncompressed) but frequently joined on different columns. This method should be used cautiously for larger tables, as it increases storage requirements and may impact scalability.

Advantages

  • Minimizes Data Redistribution: DISTRIBUTE REPLICATE eliminates the need for data movement across compute nodes, reducing query overhead by ensuring all nodes have a complete copy of the table.

  • Efficient for Filtering and Non-Filtering Joins: By avoiding runtime redistributions, replicated tables enable low-cost execution of both filtering joins and non-filtering joins, as all required data is already present on each node.

Disadvantages

  • Increased Storage Requirements: Each compute node must store a full copy of the table, potentially impacting system scalability, especially for larger datasets.

Distributing Fact Tables with Common Join Columns

For fact tables that are frequently joined to other fact or large dimension tables, selecting the appropriate distribution column is critical for achieving optimal performance.

Below are guidelines to help determine the best approach for your scenario:

GuidelineDescription
Focus on Common JoinsSelect the column most frequently used in joins with other fact or dimension tables. Tip: Identify primary key and foreign key relationships in your source data.
Choose Based on the filtered data setSelect the distribution column considering the size of the data set after filtering, rather than the total size of the table. Example: If a large dimension table is heavily filtered before a join, focus on the rows that remain after filtering.
Use High Cardinality ColumnsChoose a distribution column with high cardinality in the filtered result set. Example: A sales table distributed on a date column may seem balanced, but frequent filtering for narrow date ranges can cause skew.
Consider Surrogate Keys for Multi-Column JoinsFor multi-column joins, consider a surrogate (synthetic) distribution key that captures join patterns effectively. Example: If low-cardinality columns like customer_id and product_id are part of the join, a synthetic key (e.g., combining these columns) can improve distribution balance.

Distributing Fact Tables with Common Grouping or Partition Columns

In scenarios where a fact table is not frequently joined with other fact or large dimension tables but is heavily used in GROUP BY queries or partitioned window functions, distributing the table on common grouping or partition columns can enhance performance significantly.

The table below demonstrates two possible approaches for this distribution type:

ConsiderationGuidance
Choose the Most Common Grouping or Partition ColumnSelect the column most frequently used in GROUP BY or window function partitioning. Ensure the chosen column has high cardinality to avoid data skew.
Use a Surrogate (Synthetic) Key for Multi-Column GroupingCreate a synthetic key for tables requiring grouping or partitioning on multiple columns. This approach helps mitigate issues of low cardinality and high skew in individual columns by combining them into a balanced distribution. Example: Combine customer_id and region_id into a synthetic key: ALTER TABLE sales ADD COLUMN customer_region_key BIGINT; UPDATE sales SET customer_region_key = customer_id * 1000 + region_id;

Using DISTRIBUTE RANDOM as Last Option

DISTRIBUTE RANDOM is a fallback option for tables that are neither commonly joined nor aggregated on a specific column. This method evenly distributes rows across all compute nodes without considering logical groupings.

When to Use

DISTRIBUTE RANDOMDescription
Uncommon Use CasesApplicable for tables that lack a suitable column for DISTRIBUTE ON and are not part of frequent joins or aggregations.
Last ResortUse only if distributing on any column would lead to high skew.

Alternative: Using Surrogate (Synthetic) Keys

Before resorting to DISTRIBUTE RANDOM, consider creating a surrogate (synthetic) key for distribution. This approach can better align with query patterns while avoiding skew by combining multiple columns or transforming data to achieve a balanced distribution.
Example:

sql
ALTER TABLE example_table ADD COLUMN synthetic_key BIGINT;
UPDATE example_table SET synthetic_key = col1 * 1000 + col2;

Key Takeaways

  • Use DISTRIBUTE REPLICATE strategically for dimension tables with data size less than 1 GB.
  • Distribute tables strategically based on common join, grouping, or partitioning patterns to minimize data movement and optimize query execution. For multi-column groupings, synthetic keys are often the most effective choice for achieving even distribution.
  • DISTRIBUTE RANDOM should be avoided. Instead, using surrogate keys may align better with query patterns and help avoid skew.

Solutions to Common DISTRIBUTE ON Problems

Here are practical solutions to common challenges faced when using DISTRIBUTE ON.

Materialize Joins and Grouping Expressions on the Distribution Column

Joins and groupings on expressions always result in a redistribution, even if the data is redistributed to the same node and the expression is on the distribution column. This leads to costly execution due to both the redistribution of data and the query planner’s inability to estimate the cardinality of the expression accurately.

Example Query

sql
    -- ! WARNING !
    -- Avoid this query structure as it forces a 
    -- full redistribution of both tables during execution.
    SELECT *
    FROM t1
    JOIN t2
    ON UPPER(t1.c1) = UPPER(t2.c1);

Possible Issues

IssuesDescription
Data RedistributionRows are unnecessarily moved between compute nodes.
Query Planner LimitationsThe query planner cannot estimate the cardinality of the UPPER() expression, resulting in suboptimal execution plans.

Solution

If a join or grouping on an expression is common, the preferred approach is to materialize the expression into a new column and distribute the table using that column.

Steps For Small Tables
StepDescriptionSQL
1Rename the original table to retain the data.ALTER TABLE t1 RENAME TO t1_old;
2Create a new table with the materialized expression.CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old;
3Use the materialized column in joins and groupings.SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc;
Steps For Large Tables
StepDescriptionSQL
1Rename the original table to retain the data.ALTER TABLE t1 RENAME TO t1_old;
2Create a new table without data.CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old WITH NO DATA;
3Insert data into the new table in chunks.INSERT INTO t1 SELECT *, UPPER(c1) AS c1_uc FROM t1_old;
4Use the materialized column in joins and groupings.SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc;

Use a Surrogate Distribution Key for Multi-Column Distribution

For tables with frequent multi-column joins, groupings, or aggregations, using a surrogate distribution key can improve performance by reducing skew and optimizing data placement. This approach is particularly useful when individual columns have low cardinality or high skew but their combination results in a better distribution.

Example Queries

Query TypeExample
AggregationSELECT prod_id, loc_id, SUM(price) AS price FROM line_items GROUP BY 1, 2 ORDER BY 1, 2 LIMIT 10;
JoinSELECT prod_id, loc_id, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_id, loc_id);

Possible Issues

Low Cardinality or High SkewIndividual columns like prod_id and loc_id may result in uneven data distribution.
Complex Distribution for Multi-Column JoinsLow cardinality or high skew in individual columns (such as prod_id and loc_id from the examples above) can result in uneven data distribution across compute nodes. This imbalance affects the efficiency of joins and aggregations.

Solution

Create a surrogate distribution key by combining columns into a composite key.

Steps

StepActionSQL
1Rename the original table to retain the dataALTER TABLE line_items RENAME TO line_items_old;
2Create a new table with the surrogate keyCREATE TABLE line_items AS SELECT *, (prod_id * 2^31 + loc_id) AS prod_loc FROM line_items_old;
3Update queries to use the surrogate key for aggregationSELECT prod_loc, SUM(price) AS price FROM line_items GROUP BY prod_loc ORDER BY prod_loc LIMIT 10;
4Update queries to use the surrogate key for joinsSELECT prod_loc, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_loc);

Benefits of Surrogate Keys

BenefitDescription
Reduced SkewCombines columns to distribute data more evenly across compute nodes.
Optimized Query PerformanceAvoids redistributions during joins and groupings.
Simplified Distribution DesignProvides a single key that aligns with query patterns, simplifying data management and optimization.

Use “Fake” Values for High NULL Skew

Columns with a high percentage of NULL values can lead to significant at-rest data skew when used as distribution keys. This imbalance occurs because NULL values are treated as a single value, causing a large portion of rows to be stored on a single compute node.

Example Query

sql
    SELECT * 
    FROM line_items li 
    JOIN location l 
    ON li.loc_id = l.loc_id;

Possible Issues

IssueDescription
High NULL SkewColumns with many NULL values cause uneven data distribution, with most rows stored on one compute node.
At-Rest SkewRows with NULL values are disproportionately assigned to a single compute node, leading to imbalance.

Solution

Replace NULL values with a unique “fake” value to distribute data evenly.

Steps

StepDescriptionExample
1Identify a unique “fake” value that does not match any existing data in the related table.Example values: -999999, negative timestamps, or constants.
2Replace NULL values in the column with the fake value.UPDATE line_items SET loc_id = -999999 WHERE loc_id IS NULL;

Benefits of Using “Fake” Values

BenefitDescription
Eliminates At-Rest SkewDistributes rows with NULL values evenly across compute nodes.
Maintains Query IntegrityEnsures queries that exclude NULL values remain accurate without additional adjustments.
Avoids Runtime SkewPrevents performance issues during execution caused by unbalanced data distribution.

Summary

Effective table distribution design is a cornerstone of performance optimization in Yellowbrick’s MPP system. By adhering to these best practices, users can address challenges such as data skew, balance storage and workload distribution, and ensure efficient query execution.

Key Takeaways

TakeawayDescription
Analyze WorkloadsDefine distribution strategies based on common query patterns, join operations, and data characteristics.
Use High-Cardinality KeysMinimize skew by selecting high-cardinality distribution keys or surrogate keys for multi-column joins.
Manage NULL SkewReplace NULL values with unique placeholders to avoid skew where appropriate.
Apply DISTRIBUTE REPLICATEUse for small dimension tables to eliminate redistributions.
Avoid DISTRIBUTE RANDOMUse only when no suitable distribution key exists, as it can lead to inefficiencies in query execution.

By designing table distributions that align with workload requirements and revisiting these strategies over time, users can achieve balanced workloads, reduced data movement, and sustained system performance.