Appearance
Data Distribution Best Practices
Introduction
This document provides a guide to designing effective table distributions in Yellowbrick. By following these examples, you can address common challenges such as data skew and workload imbalances, optimizing query execution and resource utilization. The focus is on aligning table distribution with workload patterns to reduce runtime data redistribution and enhance performance.
Read Data Distribution Concepts to understand why data distribution matters in Yellowbrick.
Before You Begin
Effective planning is essential when designing table distributions in Yellowbrick. Before selecting a distribution key, it is important to address common misconceptions and evaluate workload characteristics.
INFO
If a table is already loaded in Yellowbrick and requires design optimization, use tools like YbEasyCli to query table and column metadata. These tools provide insights into key attributes, including distribution patterns and column characteristics, to help refine table distribution strategies.
Avoid Common Misconceptions
Proper understanding of table distribution in Yellowbrick is critical to avoid common misconceptions that can result in suboptimal performance. The following sections address two key misunderstandings:
Misconception 1: Even Distribution of Data is the Most Important Criteria
SQL workload characteristics determine the effectiveness of a distribution strategy, not just the uniformity of data distribution.
| Characteristics | Description |
|---|---|
| Common Join and Grouping Columns | Distribution should align with frequently joined or grouped columns to minimize redistribution. |
| Cardinality of Key Columns | High cardinality columns are preferred to ensure balanced data distribution. |
| Filtering Predicates | Predicates in queries should align with the distribution key to improve data locality. |
Uniformly distributed data alone does not guarantee optimal performance. Redistribution during query execution may negate the benefits of uniform storage if the distribution scheme does not align with workload patterns.
Misconception 2: Distribution Only Matters for Large Tables
Small tables are often overlooked during table design, but in reality, distribution on small tables can have a significant performance impact on join order and redistribution in execution plans.
Distribution should be explicitly set for both large and small tables.
Evaluate Your Workload
When determining a table’s distribution, the goal is to achieve balanced data and workload distribution across compute nodes while minimizing data redistribution during query execution. A well-planned distribution enhances both performance and resource utilization.
To optimize table distribution, focus on the following principles:
| Principle | Description |
|---|---|
| Avoid Execution Skew | Avoid situations causing “execution skew,” where some compute nodes process disproportionately more data than others. A step in a process will only be as fast as that process on the slowest compute node. |
| Avoid At-Rest Data Skew | Prevent “at-rest data skew,” where one or more compute nodes store significantly more data than others. In an MPP system, storage is limited by the compute node with the least available space. |
| Minimize Runtime Redistribution | Reduce movement of data between compute nodes during statement execution to minimize I/O and performance costs. At-rest distribution is the most common source of runtime redistribution. |
Gather the following details about your tables to design an effective distribution strategy:
| Consideration | Guidance |
|---|---|
| Table Type | Determine if the table is a fact table or a dimension table. |
| Size | Assess the number of rows and the uncompressed size of the table to gauge its impact on storage and performance. |
| For Fact Tables | 1. Identify frequent join columns, focusing on those commonly used in joins with other fact tables or large dimension tables. 2. Focus on primary and foreign keys relationships in the data. 3. Evaluate the cardinality and nullability of key columns to avoid skew caused by low cardinality or NULL values. For a detailed discussion, refer to the section Distributing by Fact Tables. |
Distributing Small and Frequently Used Dimension Tables with DISTRIBUTE REPLICATE
The DISTRIBUTE REPLICATE method, in the use cases below, ensures that all compute nodes maintain a complete copy of the table, eliminating the need for redistributions during query execution.
Use Cases
| Use Case | Description |
|---|---|
| Small Dimension Tables | Suitable for tables under 1 GB (uncompressed) that are used in joins or GROUP BY operations. |
| Larger Dimension Tables | Applicable for tables larger than 1 GB (uncompressed) but frequently joined on different columns. This method should be used cautiously for larger tables, as it increases storage requirements and may impact scalability. |
Advantages
Minimizes Data Redistribution:
DISTRIBUTE REPLICATEeliminates the need for data movement across compute nodes, reducing query overhead by ensuring all nodes have a complete copy of the table.Efficient for Filtering and Non-Filtering Joins: By avoiding runtime redistributions, replicated tables enable low-cost execution of both filtering joins and non-filtering joins, as all required data is already present on each node.
Disadvantages
- Increased Storage Requirements: Each compute node must store a full copy of the table, potentially impacting system scalability, especially for larger datasets.
Distributing Fact Tables with Common Join Columns
For fact tables that are frequently joined to other fact or large dimension tables, selecting the appropriate distribution column is critical for achieving optimal performance.
Below are guidelines to help determine the best approach for your scenario:
| Guideline | Description |
|---|---|
| Focus on Common Joins | Select the column most frequently used in joins with other fact or dimension tables. Tip: Identify primary key and foreign key relationships in your source data. |
| Choose Based on the filtered data set | Select the distribution column considering the size of the data set after filtering, rather than the total size of the table. Example: If a large dimension table is heavily filtered before a join, focus on the rows that remain after filtering. |
| Use High Cardinality Columns | Choose a distribution column with high cardinality in the filtered result set. Example: A sales table distributed on a date column may seem balanced, but frequent filtering for narrow date ranges can cause skew. |
| Consider Surrogate Keys for Multi-Column Joins | For multi-column joins, consider a surrogate (synthetic) distribution key that captures join patterns effectively. Example: If low-cardinality columns like customer_id and product_id are part of the join, a synthetic key (e.g., combining these columns) can improve distribution balance. |
Distributing Fact Tables with Common Grouping or Partition Columns
In scenarios where a fact table is not frequently joined with other fact or large dimension tables but is heavily used in GROUP BY queries or partitioned window functions, distributing the table on common grouping or partition columns can enhance performance significantly.
The table below demonstrates two possible approaches for this distribution type:
| Consideration | Guidance |
|---|---|
| Choose the Most Common Grouping or Partition Column | Select the column most frequently used in GROUP BY or window function partitioning. Ensure the chosen column has high cardinality to avoid data skew. |
| Use a Surrogate (Synthetic) Key for Multi-Column Grouping | Create a synthetic key for tables requiring grouping or partitioning on multiple columns. This approach helps mitigate issues of low cardinality and high skew in individual columns by combining them into a balanced distribution. Example: Combine customer_id and region_id into a synthetic key: ALTER TABLE sales ADD COLUMN customer_region_key BIGINT; UPDATE sales SET customer_region_key = customer_id * 1000 + region_id; |
Using DISTRIBUTE RANDOM as Last Option
DISTRIBUTE RANDOM is a fallback option for tables that are neither commonly joined nor aggregated on a specific column. This method evenly distributes rows across all compute nodes without considering logical groupings.
When to Use
| DISTRIBUTE RANDOM | Description |
|---|---|
| Uncommon Use Cases | Applicable for tables that lack a suitable column for DISTRIBUTE ON and are not part of frequent joins or aggregations. |
| Last Resort | Use only if distributing on any column would lead to high skew. |
Alternative: Using Surrogate (Synthetic) Keys
Before resorting to DISTRIBUTE RANDOM, consider creating a surrogate (synthetic) key for distribution. This approach can better align with query patterns while avoiding skew by combining multiple columns or transforming data to achieve a balanced distribution.
Example:
sql
ALTER TABLE example_table ADD COLUMN synthetic_key BIGINT;
UPDATE example_table SET synthetic_key = col1 * 1000 + col2;Key Takeaways
- Use
DISTRIBUTE REPLICATEstrategically for dimension tables with data size less than 1 GB. - Distribute tables strategically based on common join, grouping, or partitioning patterns to minimize data movement and optimize query execution. For multi-column groupings, synthetic keys are often the most effective choice for achieving even distribution.
DISTRIBUTE RANDOMshould be avoided. Instead, using surrogate keys may align better with query patterns and help avoid skew.
Solutions to Common DISTRIBUTE ON Problems
Here are practical solutions to common challenges faced when using DISTRIBUTE ON.
Materialize Joins and Grouping Expressions on the Distribution Column
Joins and groupings on expressions always result in a redistribution, even if the data is redistributed to the same node and the expression is on the distribution column. This leads to costly execution due to both the redistribution of data and the query planner’s inability to estimate the cardinality of the expression accurately.
Example Query
sql
-- ! WARNING !
-- Avoid this query structure as it forces a
-- full redistribution of both tables during execution.
SELECT *
FROM t1
JOIN t2
ON UPPER(t1.c1) = UPPER(t2.c1);Possible Issues
| Issues | Description |
|---|---|
| Data Redistribution | Rows are unnecessarily moved between compute nodes. |
| Query Planner Limitations | The query planner cannot estimate the cardinality of the UPPER() expression, resulting in suboptimal execution plans. |
Solution
If a join or grouping on an expression is common, the preferred approach is to materialize the expression into a new column and distribute the table using that column.
Steps For Small Tables
| Step | Description | SQL |
|---|---|---|
| 1 | Rename the original table to retain the data. | ALTER TABLE t1 RENAME TO t1_old; |
| 2 | Create a new table with the materialized expression. | CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old; |
| 3 | Use the materialized column in joins and groupings. | SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc; |
Steps For Large Tables
| Step | Description | SQL |
|---|---|---|
| 1 | Rename the original table to retain the data. | ALTER TABLE t1 RENAME TO t1_old; |
| 2 | Create a new table without data. | CREATE TABLE t1 AS SELECT *, UPPER(c1) AS c1_uc FROM t1_old WITH NO DATA; |
| 3 | Insert data into the new table in chunks. | INSERT INTO t1 SELECT *, UPPER(c1) AS c1_uc FROM t1_old; |
| 4 | Use the materialized column in joins and groupings. | SELECT * FROM t1 JOIN t2 ON t1.c1_uc = t2.c1_uc; |
Use a Surrogate Distribution Key for Multi-Column Distribution
For tables with frequent multi-column joins, groupings, or aggregations, using a surrogate distribution key can improve performance by reducing skew and optimizing data placement. This approach is particularly useful when individual columns have low cardinality or high skew but their combination results in a better distribution.
Example Queries
| Query Type | Example |
|---|---|
| Aggregation | SELECT prod_id, loc_id, SUM(price) AS price FROM line_items GROUP BY 1, 2 ORDER BY 1, 2 LIMIT 10; |
| Join | SELECT prod_id, loc_id, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_id, loc_id); |
Possible Issues
| Low Cardinality or High Skew | Individual columns like prod_id and loc_id may result in uneven data distribution. |
|---|---|
| Complex Distribution for Multi-Column Joins | Low cardinality or high skew in individual columns (such as prod_id and loc_id from the examples above) can result in uneven data distribution across compute nodes. This imbalance affects the efficiency of joins and aggregations. |
Solution
Create a surrogate distribution key by combining columns into a composite key.
Steps
| Step | Action | SQL |
|---|---|---|
| 1 | Rename the original table to retain the data | ALTER TABLE line_items RENAME TO line_items_old; |
| 2 | Create a new table with the surrogate key | CREATE TABLE line_items AS SELECT *, (prod_id * 2^31 + loc_id) AS prod_loc FROM line_items_old; |
| 3 | Update queries to use the surrogate key for aggregation | SELECT prod_loc, SUM(price) AS price FROM line_items GROUP BY prod_loc ORDER BY prod_loc LIMIT 10; |
| 4 | Update queries to use the surrogate key for joins | SELECT prod_loc, SUM(units) AS units FROM line_items JOIN top_prods USING (prod_loc); |
Benefits of Surrogate Keys
| Benefit | Description |
|---|---|
| Reduced Skew | Combines columns to distribute data more evenly across compute nodes. |
| Optimized Query Performance | Avoids redistributions during joins and groupings. |
| Simplified Distribution Design | Provides a single key that aligns with query patterns, simplifying data management and optimization. |
Use “Fake” Values for High NULL Skew
Columns with a high percentage of NULL values can lead to significant at-rest data skew when used as distribution keys. This imbalance occurs because NULL values are treated as a single value, causing a large portion of rows to be stored on a single compute node.
Example Query
sql
SELECT *
FROM line_items li
JOIN location l
ON li.loc_id = l.loc_id;Possible Issues
| Issue | Description |
|---|---|
| High NULL Skew | Columns with many NULL values cause uneven data distribution, with most rows stored on one compute node. |
| At-Rest Skew | Rows with NULL values are disproportionately assigned to a single compute node, leading to imbalance. |
Solution
Replace NULL values with a unique “fake” value to distribute data evenly.
Steps
| Step | Description | Example |
|---|---|---|
| 1 | Identify a unique “fake” value that does not match any existing data in the related table. | Example values: -999999, negative timestamps, or constants. |
| 2 | Replace NULL values in the column with the fake value. | UPDATE line_items SET loc_id = -999999 WHERE loc_id IS NULL; |
Benefits of Using “Fake” Values
| Benefit | Description |
|---|---|
| Eliminates At-Rest Skew | Distributes rows with NULL values evenly across compute nodes. |
| Maintains Query Integrity | Ensures queries that exclude NULL values remain accurate without additional adjustments. |
| Avoids Runtime Skew | Prevents performance issues during execution caused by unbalanced data distribution. |
Summary
Effective table distribution design is a cornerstone of performance optimization in Yellowbrick’s MPP system. By adhering to these best practices, users can address challenges such as data skew, balance storage and workload distribution, and ensure efficient query execution.
Key Takeaways
| Takeaway | Description |
|---|---|
| Analyze Workloads | Define distribution strategies based on common query patterns, join operations, and data characteristics. |
| Use High-Cardinality Keys | Minimize skew by selecting high-cardinality distribution keys or surrogate keys for multi-column joins. |
Manage NULL Skew | Replace NULL values with unique placeholders to avoid skew where appropriate. |
Apply DISTRIBUTE REPLICATE | Use for small dimension tables to eliminate redistributions. |
Avoid DISTRIBUTE RANDOM | Use only when no suitable distribution key exists, as it can lead to inefficiencies in query execution. |
By designing table distributions that align with workload requirements and revisiting these strategies over time, users can achieve balanced workloads, reduced data movement, and sustained system performance.