Ordering Data Efficiently

The purpose of this guide is to provide best-practices for ordering data in Yellowbrick tables. Efficient data ordering leads to higher query throughput, lower response times and better data compression.

See Data Ordering Concepts to understand the foundational ideas illustrated here.

An Example Shard

Below is a DDL and a logical view of a portion of a shard of a purchases fact table. This example will be used throughout most of the query samples in this document.

sql

CREATE TABLE purchases (
prch_dt DATE
, geo_id INTEGER
, acct INTEGER
, subacct INTEGER
, prch_amt DECIMAL(9,2)
)
DISTRIBUTE ON ( acct )
SORT ON ( prch_dt )
;

This table is ordered within each shard on a single column (prch_dt) due to the SORT ON attribute. Since fact tables are often loaded in date order, they will likely be naturally ordered by date across shards.

Shard Data and Metadata Illustration

Below is a logical representation of a shard, including its blocks. The table shows shard-level information, such as the range of values (Min and Max) for each column, and block-level details, as the number of rows and block-specific column data.

		prch_dt	geo_id	acct	subacct	prch_amt
Shard 1
Shard MD	Min: Max:	2020-01-01 2020-01-03	1 4	10000 99000	10010 99890	100.00 11234.56
Block MD	Min: Max: Block:	2020-01-01 2020-01-01 1	2 4 2	11000 98000 3	11010 98190 4	121.45 4145.09 5
Block Data	Rows: 1-32k	2020-01-01, ..., 2020-01-01	2, 2, 2, 2, 3, ...	12000, 68000, 98000, 34000, 11000,...	12200, 68400, 98190, 34220, 11010,...	340.00, 265.90, 1023.89,...
Block MD	Min: Max: Block:	2020-01-01 2020-01-02 6	1 4 7	19000 97000 8	19030 97770 9	187.60 11001.23 10
Block Data	Rows: 32k-64k	2020-01-01, ..., 2020-01-02	4, 4, 1, 1, 2, ...	20000, 97000, 56000, 22000, 97000,...	22340, 97600, 56010, 22000, 97070,...	231.45, 668.88, 11001.23,...
Block MD	Min: Max: Block:	2020-01-02 2020-01-02 11	2 4 12	19000 97000 13	13030 97070 14	187.60 11001.23 15
Block Data	Rows: 32k-64k	2020-01-02, ..., 2020-01-02	2, 2, 2, 3, 3, ...	20000, 97000, 56000, 22000, 97000,...	22340, 97600, 56010, 22000, 97000,...	231.45, 668.88, 11001.23,...

For brevity, consider the row range 1-32k to be inclusive, and the range 32k-64k to be inclusive of rows 32,001 and 64,000.

For example, consider the query:

sql

SELECT prch_amt
FROM purchases
WHERE
prch_dt = '2020-01-01'
AND geo_id = 1;

Using the shard and block details from the example table above:

Shard Skipping

The shard cannot be skipped because the shard metadata indicates that prch_dt = '2020-01-01' falls within the shard's date range (2020-01-01 to 2020-01-03).

Block Skipping

Most blocks can be skipped. Block metadata reveals the following:

Only rows 1-64k have matching prch_dt values.
Within that range, only rows higher than 32k have matching geo_id values.

The resulting set includes only the blocks for the queried columns (prch_dt, geo_id, and prch_amt) for rows 32k-64k. Specifically, blocks 6, 7, and 11.

Correlated Columns

It is important to note that the data is well ordered within the shard based on common filtering predicates, starting with prch_dt followed by geo_id, even though the SORT ON clause only specifies prch_dt. In Yellowbrick, correlated columns refer to cases where the ordering of one column reflects or aligns with the ordering of another. Unlike indexes in traditional relational databases, which might involve creating an index on multiple columns, Yellowbrick utilizes these relationships by allowing data to be ordered on a single column. This approach also enhances scan performance and compression efficiency.

Using a Calendar Table as an Example

A calendar table demonstrates how the same date can be represented in multiple formats, such as numeric, textual, or string based. Below is a representation of such a table:

col	dt	yyyymmdd	yyyymm	yyyy	m	dt_vc	dt_str
type	DATE	INT4	INT4	INT2	INT2	VARCHAR	VARCHAR
example	2024-01-23	20240123	202401	2024	1	'024-01-23'	'Jan 23, 2024'

There are many kinds of relationships between columns that we can leverage.

Relationship	Usage
1:1, Same Ordering	For example, a calendar table where a date exists as a DATE, an INTEGER, and a text string. In this case, ordering by the smallest representation is recommended. A date stored as a DATE or INTEGER uses 4 bytes, fitting into a CPU register and allowing efficient numeric sorting. Sorting is significantly faster than using a 10-character string format (e.g., 2024-01-23).
1:1, Different Ordering	For example, a calendar table with a column containing the spelled-out name, such as January 1, 2024. Although there is a 1:1 relationship with the date, their sort orders differ. Sorting by the date provides better scan efficiency and improves compression. Dictionary-based algorithms (e.g., ZIP) effectively compress repeated strings like month names.
Parent-Child, Same Ordering	Ordering by the child column (e.g., yyyymmdd) ensures its parent columns (e.g., yyyymm and yyyy) are ordered naturally.
Parent-Child, Different Ordering	For example, columns like geo_name and country_name. When queries frequently filter on both columns, prioritize ordering by the lower-cardinality key (e.g., geography) first, followed by the higher-cardinality key (e.g., country).
Commonly Found Together	For example, product category and product subcategory. Ordering these columns together does not greatly assist with shard or block skipping but can significantly improve compression in large datasets. Ordering by the column with lower cardinality first (e.g., product category) and then by the column with higher cardinality (e.g., product subcategory) ensures that similar values are grouped together. This grouping allows compression algorithms to achieve better results by reducing redundancy.

Data Compression

Efficient data sorting improves compression. Most compression (or encoding) algorithms rely on or are significantly impacted by the order of the data, particularly when values are repeated in adjacent rows. Two of the most common compression methods are:

Algorithm	Characteristics
Run-Length Encoding (RLE)	Optimized for consecutive repeated values.
Dictionary	Optimized for repeating values across a block of rows.

Ordering data on multiple columns can provide benefits for both compression and scan performance.

Compression becomes increasingly effective as column size grows and cardinality decreases. For example, consider a block of 16k rows containing a 10-character string repeated 100 times in adjacent rows. Using run-length compression, the storage required for these rows in the compressed block would shrink from 1,200 bytes to just 14 bytes.

Table SORT ON and CLUSTER ON

Yellowbrick provides two order-related attributes that can be set during table creation: SORT ON and CLUSTER ON. These attributes are immutable, meaning any changes require dropping and recreating the table.

SORT ON

The SORT ON attribute in Yellowbrick defines the order of data within shards, but not across shards, when written to disk. This applies during operations such as bulk loads, INSERT SELECT, CREATE TABLE AS SELECT (CTAS), and when shards are merged by Yellowbrick's storage garbage collector (GC).

INFO

Garbage Collector – In Yellowbrick, GC refers to automated shard maintenance processes such as rewriting shards with high numbers of deleted rows and merging small shards (less than 250MB). It is not related to memory garbage collection in programming languages like Java or C#.

Key Characteristics

SORT ON	Scope
Shards	Works within shards, not across shards. Operations like cluster expansion, GC shard merging, and `INSERT SELECT` without an `ORDER BY` clause may not preserve ordering across shards. For example, an `INSERT SELECT` operation involving 10 shards per compute node over 50 days without `ORDER BY` date could result in up to 50 different dates in each shard. However, rows within each shard would still be ordered by date.
Correlated Columns	The benefits of `SORT ON` extend beyond the specified column. For example, sorting an invoices table by invoice_id often implies de facto ordering by invoice_date, due to column correlation.
Performance	Can improve the performance of queries with filtering predicates and, in some cases, joins. Does not improve the performance of `GROUP BY` or `ORDER BY` operations.
Small Tables	Typically not worth applying to tables with fewer than ~50k rows per compute node.

Limitations

SORT ON	Limitations
Single Column	`SORT ON` can only be applied to one column. For multi-column ordering, consider creating a composite (synthetic) sort key. See the Data Distribution How-To Guide for more information.
Encryption	Cannot be applied to encrypted columns.
Immutability	Cannot be added to or changed on an existing table. You must drop and recreate the table to apply a new `SORT ON`.

CLUSTER ON

CLUSTER ON is an alternative to SORT ON. It allows you to specify multiple columns but is not equivalent to a multi-column sort or cube. In most cases, it is not recommended and should only be used in highly specific scenarios.

Key Characteristics

CLUSTER ON	Scope
Not a Multi-Column Sort	CLUSTER ON does not provide ordered sorting across columns. For cases requiring multi-column order, use a composite sort key instead.
Low Cardinality Keys	Effective only when applied to low cardinality columns of equal weight.
High Cost	It has significant performance overhead during data loads and INSERT operations.

WARNING

In practice, CLUSTER ON is rarely beneficial and carries high costs. Unless you are certain that your use case justifies it, you should avoid using CLUSTER ON and instead explore alternative methods, such as composite sort keys, to achieve your goals.

Ordering Within vs Across Shards

Data ordering is an essential aspect of table design in Yellowbrick. The method used to load data—Bulk Load, Insert-Select, or Trickle Load—determines how data is ordered within and across shards, directly impacting query performance and compression efficiency. The following table compares the effects of these three loading methods on ordering:

Method	Within Shards	Across Shards
Bulk Load	Data within shards is ordered according to the SORT ON column if defined. Without SORT ON, ingestion order is preserved.	No guaranteed ordering across shards due to parallelized data ingestion.
Insert-Select	An ORDER BY clause ensures data is ordered within shards before being written. Without it, the ingestion sequence determines the order.	Guarantees ordering across shards if an ORDER BY clause is included in the query.
Trickle Load	Data retains its streaming order unless reordered by a SORT ON column.	Incremental ingestion does not ensure ordering across shards due to the nature of real-time streaming.

Inherent Data Ordering

Data loaded into shards retains its incoming order unless explicitly reordered by the SORT ON column. However, due to the highly parallelized nature of ybload, this order may differ from that of the original source file(s).

Fact data is often loaded in a natural date order, either incrementally throughout the day or in batches at the end of the day. Queries on fact tables also typically include specific date range filters. As a result, it is common to define the SORT ON column as the most frequently used filtering predicate. This approach works well for tables with a small number of shards per compute node and for shards that are rarely updated. For larger tables, consider using composite sort keys and carefully designed ordering. See the Data Distribution How-To Guide for instructions on creating surrogate keys.

The SORT ON and CLUSTER ON clauses determine the order of data within individual shards but do not impose an order across shards. When data is streamed to a compute node, the stream is divided into 1 GB buffers. Once a buffer is full, it is sorted (if a SORT ON or CLUSTER ON clause is defined for the table) and written out as a shard. This behavior applies to most backend data appends, including bulk loads, \COPY, INSERT VALUESoperations larger than 30 MB, and INSERT SELECT without an ORDER BY clause.

For cross-shard data ordering, use INSERT SELECT ORDER BY. This method sorts all data before streaming it to the shard writer, ensuring consistent ordering across shards.

Once data is written to shards, its order remains consistent except in the following scenarios:

Case	Description
UPDATE	Updates are processed as a DELETE followed by an append. Rows are not modified in place; instead, updated rows are written to new shards. If the update size is small (less than 250 MB), these smaller shards will be merged with others during garbage collector operations.
DELETE	When shards have less than 250 MB of undeleted rows, the garbage collector merges them. During this process, data within the new shards is sorted. This typically occurs after large-scale DELETE or UPDATE operations.
Cluster Expansion	Changes to the cluster topology, such as replication to a new configuration, alter the distribution of rows across compute nodes. This inevitably changes the order of rows as well.
Replication / Restore	Backup and restore operations are highly parallelized, meaning rows may not be restored in the same order they originally existed. Additionally, restoring or replicating to a different topology (e.g., with a different number of compute nodes or chassis) guarantees changes in both the row distribution and their order.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause

Ordering Data Efficiently

An Example Shard

Shard Data and Metadata Illustration

Correlated Columns

Using a Calendar Table as an Example

Data Compression

Table SORT ON and CLUSTER ON

SORT ON

Key Characteristics

Limitations

CLUSTER ON

Key Characteristics

Ordering Within vs Across Shards

Inherent Data Ordering