Skip to content

Data Ordering

Introduction

In analytic MPP databases such as Yellowbrick, data ordering is a critical table design consideration, second only to data distribution. Proper data ordering enhances SQL statement performance by optimizing scan efficiency and reduces storage requirements through optimized compression.

  • Tables with effective ordering allow the database to skip entire shards or blocks during scans when a filtering predicate involves an order column or a column with high correlation to the order column.

  • Compression efficiency is directly influenced by data order, as most compression algorithms benefit from sequential or repeated values and contiguous byte sequences.

The order of data within a table is determined by several factors, including table definition attributes, the method of data loading, data modifications such as inserts, updates, and deletes, and built-in database storage maintenance and management.

Data Storage in Yellowbrick

In Yellowbrick, user data is stored on compute nodes in a columnar block format, with data organized in sets of rows by column. Understanding the mechanics of data storage is key to comprehending the impact of data ordering on scanning efficiency and compression.

The fundamental unit of storage in Yellowbrick is a shard, which:

  • Holds up to 1GB of data (before compression).

  • Contains data and metadata for rows from a single table.

  • Stores data into blocks per column, with up to 32,000 rows per block.

  • It is immutable and never updated in place.

INFO

It is important not to confuse a Yellowbrick data shard with the "shard" concept from sharded databases (such as MongoDB, for example), where a shard refers to a distribution partition.

Shard Components in Yellowbrick

A shard in Yellowbrick is composed of five key parts:

ComponentDescription
Shard MetadataIncludes per-column metadata such as the number of rows, number of nulls, and minimum and maximum values, etc. This metadata acts as an alternative to traditional indexes and is necessary for enabling shard skipping, which significantly enhances scan performance.
Block DataContains compressed data for a column, with up to 32,000 rows per block.
Block MetadataSimilar to shard metadata, it includes per-column information such as the number of rows, number of nulls, and minimum and maximum values, and others. This metadata enables block skipping within shards, further improving scan performance.
Parity BlocksYellowbrick makes use of a RAID 6-like parity mechanism striped across compute nodes rather than disks within a compute node. Parity data is also stored within shards.
Boundary alignmentAll shards are aligned to end on a 2 MB boundary.

Shard and Block Illustration

The following provides a logical illustration of a shard layout. Although block data is only one of the five components of a shard, it occupies most of the space in a full shard (i.e., often more than 99%).

Metadata Type / DataRow Scope / Data
Shard metadataRows 1 - ...
col1 (min, max, nulls,...)col2 (min, max, nulls,...)...
Block metadataRows 1 to 32k
col1:block1 (min, max, ...)col2:block1 (min, max, ...)...
Block DataRows 1 to 32k
col1: block1col2:block1...
Block metadataRows 32k + 1 to 64k
col1: block2 (min, max, ...)col2: block2 (min, max, ...)...
Block DataRows 32k + 1 to 64k
col1: block2col2: block2...
.........
Data Parity Blocks(All)
node1:shard1:parityBlk1node1:shard1:parityBlk2...
.........
Boundary Alignment(up to 2 MB)

Shard and Block Scan and Skipping

To optimize data scans, the goal is to read as few shards as possible and, within those shards, as few blocks as possible. Shard and block skipping (also known as range-restricted scans) uses shard and block metadata to exclude unnecessary shards and blocks from the scan — the Yellowbrick columnar datastore does not use indexes.

For a practical guide to optimizing data ordering, see the Data Ordering How-To Guide.