Appearance
Ordering Data Efficiently
The purpose of this guide is to provide best-practices for ordering data in Yellowbrick tables. Efficient data ordering leads to higher query throughput, lower response times and better data compression.
See Data Ordering Concepts to understand the foundational ideas illustrated here.
An Example Shard
Below is a DDL and a logical view of a portion of a shard of a purchases fact table. This example will be used throughout most of the query samples in this document.
sql
CREATE TABLE purchases (
prch_dt DATE
, geo_id INTEGER
, acct INTEGER
, subacct INTEGER
, prch_amt DECIMAL(9,2)
)
DISTRIBUTE ON ( acct )
SORT ON ( prch_dt )
;This table is ordered within each shard on a single column (prch_dt) due to the SORT ON attribute. Since fact tables are often loaded in date order, they will likely be naturally ordered by date across shards.
Shard Data and Metadata Illustration
Below is a logical representation of a shard, including its blocks. The table shows shard-level information, such as the range of values (Min and Max) for each column, and block-level details, as the number of rows and block-specific column data.
| prch_dt | geo_id | acct | subacct | prch_amt | ||
|---|---|---|---|---|---|---|
| Shard 1 | ||||||
| Shard MD | Min: Max: | 2020-01-01 2020-01-03 | 1 4 | 10000 99000 | 10010 99890 | 100.00 11234.56 |
| Block MD | Min: Max: Block: | 2020-01-01 2020-01-01 1 | 2 4 2 | 11000 98000 3 | 11010 98190 4 | 121.45 4145.09 5 |
| Block Data | Rows: 1-32k | 2020-01-01, ..., 2020-01-01 | 2, 2, 2, 2, 3, ... | 12000, 68000, 98000, 34000, 11000,... | 12200, 68400, 98190, 34220, 11010,... | 340.00, 265.90, 1023.89,... |
| Block MD | Min: Max: Block: | 2020-01-01 2020-01-02 6 | 1 4 7 | 19000 97000 8 | 19030 97770 9 | 187.60 11001.23 10 |
| Block Data | Rows: 32k-64k | 2020-01-01, ..., 2020-01-02 | 4, 4, 1, 1, 2, ... | 20000, 97000, 56000, 22000, 97000,... | 22340, 97600, 56010, 22000, 97070,... | 231.45, 668.88, 11001.23,... |
| Block MD | Min: Max: Block: | 2020-01-02 2020-01-02 11 | 2 4 12 | 19000 97000 13 | 13030 97070 14 | 187.60 11001.23 15 |
| Block Data | Rows: 32k-64k | 2020-01-02, ..., 2020-01-02 | 2, 2, 2, 3, 3, ... | 20000, 97000, 56000, 22000, 97000,... | 22340, 97600, 56010, 22000, 97000,... | 231.45, 668.88, 11001.23,... |
For brevity, consider the row range 1-32k to be inclusive, and the range 32k-64k to be inclusive of rows 32,001 and 64,000.
For example, consider the query:
sql
SELECT prch_amt
FROM purchases
WHERE
prch_dt = '2020-01-01'
AND geo_id = 1;Using the shard and block details from the example table above:
- Shard Skipping
The shard cannot be skipped because the shard metadata indicates that
prch_dt = '2020-01-01'falls within the shard's date range (2020-01-01 to 2020-01-03).
- Block Skipping
Most blocks can be skipped. Block metadata reveals the following:
Only rows 1-64k have matching
prch_dtvalues.Within that range, only rows higher than 32k have matching
geo_idvalues.
The resulting set includes only the blocks for the queried columns (prch_dt, geo_id, and prch_amt) for rows 32k-64k. Specifically, blocks 6, 7, and 11.
Correlated Columns
It is important to note that the data is well ordered within the shard based on common filtering predicates, starting with prch_dt followed by geo_id, even though the SORT ON clause only specifies prch_dt. In Yellowbrick, correlated columns refer to cases where the ordering of one column reflects or aligns with the ordering of another. Unlike indexes in traditional relational databases, which might involve creating an index on multiple columns, Yellowbrick utilizes these relationships by allowing data to be ordered on a single column. This approach also enhances scan performance and compression efficiency.
Using a Calendar Table as an Example
A calendar table demonstrates how the same date can be represented in multiple formats, such as numeric, textual, or string based. Below is a representation of such a table:
| col | dt | yyyymmdd | yyyymm | yyyy | m | dt_vc | dt_str |
|---|---|---|---|---|---|---|---|
| type | DATE | INT4 | INT4 | INT2 | INT2 | VARCHAR | VARCHAR |
| example | 2024-01-23 | 20240123 | 202401 | 2024 | 1 | '024-01-23' | 'Jan 23, 2024' |
There are many kinds of relationships between columns that we can leverage.
| Relationship | Usage |
|---|---|
| 1:1, Same Ordering | For example, a calendar table where a date exists as a DATE, an INTEGER, and a text string. In this case, ordering by the smallest representation is recommended. A date stored as a DATE or INTEGER uses 4 bytes, fitting into a CPU register and allowing efficient numeric sorting. Sorting is significantly faster than using a 10-character string format (e.g., 2024-01-23). |
| 1:1, Different Ordering | For example, a calendar table with a column containing the spelled-out name, such as January 1, 2024. Although there is a 1:1 relationship with the date, their sort orders differ. Sorting by the date provides better scan efficiency and improves compression. Dictionary-based algorithms (e.g., ZIP) effectively compress repeated strings like month names. |
| Parent-Child, Same Ordering | Ordering by the child column (e.g., yyyymmdd) ensures its parent columns (e.g., yyyymm and yyyy) are ordered naturally. |
| Parent-Child, Different Ordering | For example, columns like geo_name and country_name. When queries frequently filter on both columns, prioritize ordering by the lower-cardinality key (e.g., geography) first, followed by the higher-cardinality key (e.g., country). |
| Commonly Found Together | For example, product category and product subcategory. Ordering these columns together does not greatly assist with shard or block skipping but can significantly improve compression in large datasets. Ordering by the column with lower cardinality first (e.g., product category) and then by the column with higher cardinality (e.g., product subcategory) ensures that similar values are grouped together. This grouping allows compression algorithms to achieve better results by reducing redundancy. |
Data Compression
Efficient data sorting improves compression. Most compression (or encoding) algorithms rely on or are significantly impacted by the order of the data, particularly when values are repeated in adjacent rows. Two of the most common compression methods are:
| Algorithm | Characteristics |
|---|---|
| Run-Length Encoding (RLE) | Optimized for consecutive repeated values. |
| Dictionary | Optimized for repeating values across a block of rows. |
Ordering data on multiple columns can provide benefits for both compression and scan performance.
Compression becomes increasingly effective as column size grows and cardinality decreases. For example, consider a block of 16k rows containing a 10-character string repeated 100 times in adjacent rows. Using run-length compression, the storage required for these rows in the compressed block would shrink from 1,200 bytes to just 14 bytes.
Table SORT ON and CLUSTER ON
Yellowbrick provides two order-related attributes that can be set during table creation: SORT ON and CLUSTER ON. These attributes are immutable, meaning any changes require dropping and recreating the table.
SORT ON
The SORT ON attribute in Yellowbrick defines the order of data within shards, but not across shards, when written to disk. This applies during operations such as bulk loads, INSERT SELECT, CREATE TABLE AS SELECT (CTAS), and when shards are merged by Yellowbrick's storage garbage collector (GC).
INFO
Garbage Collector – In Yellowbrick, GC refers to automated shard maintenance processes such as rewriting shards with high numbers of deleted rows and merging small shards (less than 250MB). It is not related to memory garbage collection in programming languages like Java or C#.
Key Characteristics
| SORT ON | Scope |
|---|---|
| Shards | Works within shards, not across shards. Operations like cluster expansion, GC shard merging, and INSERT SELECT without an ORDER BY clause may not preserve ordering across shards. For example, an INSERT SELECT operation involving 10 shards per compute node over 50 days without ORDER BY date could result in up to 50 different dates in each shard. However, rows within each shard would still be ordered by date. |
| Correlated Columns | The benefits of SORT ON extend beyond the specified column. For example, sorting an invoices table by invoice_id often implies de facto ordering by invoice_date, due to column correlation. |
| Performance | Can improve the performance of queries with filtering predicates and, in some cases, joins. Does not improve the performance of GROUP BY or ORDER BY operations. |
| Small Tables | Typically not worth applying to tables with fewer than ~50k rows per compute node. |
Limitations
| SORT ON | Limitations |
|---|---|
| Single Column | SORT ON can only be applied to one column. For multi-column ordering, consider creating a composite (synthetic) sort key. See the Data Distribution How-To Guide for more information. |
| Encryption | Cannot be applied to encrypted columns. |
| Immutability | Cannot be added to or changed on an existing table. You must drop and recreate the table to apply a new SORT ON. |
CLUSTER ON
CLUSTER ON is an alternative to SORT ON. It allows you to specify multiple columns but is not equivalent to a multi-column sort or cube. In most cases, it is not recommended and should only be used in highly specific scenarios.
Key Characteristics
| CLUSTER ON | Scope |
|---|---|
| Not a Multi-Column Sort | CLUSTER ON does not provide ordered sorting across columns. For cases requiring multi-column order, use a composite sort key instead. |
| Low Cardinality Keys | Effective only when applied to low cardinality columns of equal weight. |
| High Cost | It has significant performance overhead during data loads and INSERT operations. |
WARNING
In practice, CLUSTER ON is rarely beneficial and carries high costs. Unless you are certain that your use case justifies it, you should avoid using CLUSTER ON and instead explore alternative methods, such as composite sort keys, to achieve your goals.
Ordering Within vs Across Shards
Data ordering is an essential aspect of table design in Yellowbrick. The method used to load data—Bulk Load, Insert-Select, or Trickle Load—determines how data is ordered within and across shards, directly impacting query performance and compression efficiency. The following table compares the effects of these three loading methods on ordering:
| Method | Within Shards | Across Shards |
|---|---|---|
| Bulk Load | Data within shards is ordered according to the SORT ON column if defined. Without SORT ON, ingestion order is preserved. | No guaranteed ordering across shards due to parallelized data ingestion. |
| Insert-Select | An ORDER BY clause ensures data is ordered within shards before being written. Without it, the ingestion sequence determines the order. | Guarantees ordering across shards if an ORDER BY clause is included in the query. |
| Trickle Load | Data retains its streaming order unless reordered by a SORT ON column. | Incremental ingestion does not ensure ordering across shards due to the nature of real-time streaming. |
Inherent Data Ordering
Data loaded into shards retains its incoming order unless explicitly reordered by the SORT ON column. However, due to the highly parallelized nature of ybload, this order may differ from that of the original source file(s).
Fact data is often loaded in a natural date order, either incrementally throughout the day or in batches at the end of the day. Queries on fact tables also typically include specific date range filters. As a result, it is common to define the SORT ON column as the most frequently used filtering predicate. This approach works well for tables with a small number of shards per compute node and for shards that are rarely updated. For larger tables, consider using composite sort keys and carefully designed ordering. See the Data Distribution How-To Guide for instructions on creating surrogate keys.
The SORT ON and CLUSTER ON clauses determine the order of data within individual shards but do not impose an order across shards. When data is streamed to a compute node, the stream is divided into 1 GB buffers. Once a buffer is full, it is sorted (if a SORT ON or CLUSTER ON clause is defined for the table) and written out as a shard. This behavior applies to most backend data appends, including bulk loads, \COPY, INSERT VALUESoperations larger than 30 MB, and INSERT SELECT without an ORDER BY clause.
For cross-shard data ordering, use INSERT SELECT ORDER BY. This method sorts all data before streaming it to the shard writer, ensuring consistent ordering across shards.
Once data is written to shards, its order remains consistent except in the following scenarios:
| Case | Description |
|---|---|
| UPDATE | Updates are processed as a DELETE followed by an append. Rows are not modified in place; instead, updated rows are written to new shards. If the update size is small (less than 250 MB), these smaller shards will be merged with others during garbage collector operations. |
| DELETE | When shards have less than 250 MB of undeleted rows, the garbage collector merges them. During this process, data within the new shards is sorted. This typically occurs after large-scale DELETE or UPDATE operations. |
| Cluster Expansion | Changes to the cluster topology, such as replication to a new configuration, alter the distribution of rows across compute nodes. This inevitably changes the order of rows as well. |
| Replication / Restore | Backup and restore operations are highly parallelized, meaning rows may not be restored in the same order they originally existed. Additionally, restoring or replicating to a different topology (e.g., with a different number of compute nodes or chassis) guarantees changes in both the row distribution and their order. |