Data Ordering

Introduction

In analytic MPP databases such as Yellowbrick, data ordering is a critical table design consideration, second only to data distribution. Proper data ordering enhances SQL statement performance by optimizing scan efficiency and reduces storage requirements through optimized compression.

Tables with effective ordering allow the database to skip entire shards or blocks during scans when a filtering predicate involves an order column or a column with high correlation to the order column.
Compression efficiency is directly influenced by data order, as most compression algorithms benefit from sequential or repeated values and contiguous byte sequences.

The order of data within a table is determined by several factors, including table definition attributes, the method of data loading, data modifications such as inserts, updates, and deletes, and built-in database storage maintenance and management.

Data Storage in Yellowbrick

In Yellowbrick, user data is stored on compute nodes in a columnar block format, with data organized in sets of rows by column. Understanding the mechanics of data storage is key to comprehending the impact of data ordering on scanning efficiency and compression.

The fundamental unit of storage in Yellowbrick is a shard, which:

Holds up to 1GB of data (before compression).
Contains data and metadata for rows from a single table.
Stores data into blocks per column, with up to 32,000 rows per block.
It is immutable and never updated in place.

INFO

It is important not to confuse a Yellowbrick data shard with the "shard" concept from sharded databases (such as MongoDB, for example), where a shard refers to a distribution partition.

Shard Components in Yellowbrick

A shard in Yellowbrick is composed of five key parts:

Component	Description
Shard Metadata	Includes per-column metadata such as the number of rows, number of nulls, and minimum and maximum values, etc. This metadata acts as an alternative to traditional indexes and is necessary for enabling shard skipping, which significantly enhances scan performance.
Block Data	Contains compressed data for a column, with up to 32,000 rows per block.
Block Metadata	Similar to shard metadata, it includes per-column information such as the number of rows, number of nulls, and minimum and maximum values, and others. This metadata enables block skipping within shards, further improving scan performance.
Parity Blocks	Yellowbrick makes use of a RAID 6-like parity mechanism striped across compute nodes rather than disks within a compute node. Parity data is also stored within shards.
Boundary alignment	All shards are aligned to end on a 2 MB boundary.

Shard and Block Illustration

The following provides a logical illustration of a shard layout. Although block data is only one of the five components of a shard, it occupies most of the space in a full shard (i.e., often more than 99%).

Metadata Type / Data	Row Scope / Data
Shard metadata	Rows 1 - ...
col1 (min, max, nulls,...)	col2 (min, max, nulls,...)	...
Block metadata	Rows 1 to 32k
col1:block1 (min, max, ...)	col2:block1 (min, max, ...)	...
Block Data	Rows 1 to 32k
col1: block1	col2:block1	...
Block metadata	Rows 32k + 1 to 64k
col1: block2 (min, max, ...)	col2: block2 (min, max, ...)	...
Block Data	Rows 32k + 1 to 64k
col1: block2	col2: block2	...
...	...	...
Data Parity Blocks	(All)
node1:shard1:parityBlk1	node1:shard1:parityBlk2	...
...	...	...
Boundary Alignment	(up to 2 MB)

Shard and Block Scan and Skipping

To optimize data scans, the goal is to read as few shards as possible and, within those shards, as few blocks as possible. Shard and block skipping (also known as range-restricted scans) uses shard and block metadata to exclude unnecessary shards and blocks from the scan — the Yellowbrick columnar datastore does not use indexes.

For a practical guide to optimizing data ordering, see the Data Ordering How-To Guide.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT