Distributing Data

When you are creating tables, it is important to define the data distribution scheme for the table, which determines how the rows are distributed across the worker nodes. Good data distribution is critical to optimize parallel processing for queries and other database operations.

Within a CREATE TABLE statement, you can define a single-column distribution key and distribute the data by hashing on the values in that column. Alternatively, you can set the distribution to RANDOM or REPLICATE.

DISTRIBUTE ON (column)

Hash distribution across the worker nodes based on the values in the specified column. This option is recommended for most tables. For example:

create table team
(teamid smallint, htid smallint, atid smallint, name varchar(30), nickname varchar(20), city varchar(20), stadium varchar(50), capacity int) 
distribute on (teamid);

DISTRIBUTE REPLICATE

Replication of the entire table across all blades. This option is intended for smaller tables. For example:

create table season
(seasonid smallint, season_name character(9), numteams smallint, winners varchar(30)) 
distribute replicate;

DISTRIBUTE RANDOM

Random distribution of table rows.

If you do not specify the DISTRIBUTE clause, the table is hash-distributed on the first named column in the table.

Data Distribution for CREATE TABLE AS Results

If you do not specify a distribution type for a CTAS table, the resulting data distribution depends on the nature of the query that creates the table.

A table created from columns in one or more replicated tables is also replicated.
A table created from a single hash-distributed table is typically hash-distributed on the same column when the distribution column is included in the select list. If the distribution column is not included in the select list, random distribution is used.
Tables resulting from equality joins over distribution columns typically preserve the hash distribution.
In general, hash distribution is preserved where possible, but tables created from complex joins or that contain complex select list expressions may produce randomly distributed result sets.

Data Skew with NULLs and NaNs in Distribution Columns

When the distribution column of a table contains a significant number of NULL values or NaN values, the data may not be distributed evenly. Extreme data skew on the worker nodes may occur. In general, it is best to avoid distributing a table on any column that is nullable or a floating-point column that it is likely to contain a large number of NaN values.

For more details about NaN, see DOUBLE PRECISION.

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

sys.lock

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Loading from Amazon S3

Loading from Azure Blob Storage

Setting up and Running a Spark Job

Setting Up the ybrelay Service

LDAP Authentication

Synchronizing Users and Groups

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

AWS Marketplace

Create Stack

Docker

Cloud: Configuration

Vanity DNS

Yellowbrick Manager

Cloud: Enterprise Edition Getting Started

SQL-Based Loads from External Storage

Cloud: Installation

CLI Install Instructions

Permissions

Private Install Instructions

Public Install Instructions

Cloud: Kubernetes Guides

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

Data Type Casting

DECIMAL

JSON

JSONB

SQL String Constants

Aggregate Functions

Conditional Expressions

Datetime Functions

Formatting Functions

Geospatial functions

Mathematical Functions

Network Address Functions

Pattern Matching

Regular Expression Details

SQL Operators and Pattern Matching Functions

SQL Conditions

SQL User Defined Function (UDF)

SQL UDF Create Function

String Functions

ENCRYPT_KS

System Functions

Type-Safe Casting Functions

Window Functions

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

Distributing Data ​

Data Distribution for CREATE TABLE AS Results ​

Data Skew with NULLs and NaNs in Distribution Columns ​

Distributing Data

Data Distribution for CREATE TABLE AS Results

Data Skew with NULLs and NaNs in Distribution Columns