Skip to content

Sorted and Clustered Tables

A Yellowbrick table may be sorted on one column or clustered on multiple columns (up to four). You cannot specify a sort column and cluster columns on the same table. After specifying one of these options, you cannot change the specification without dropping and re-creating the table.

A table that is sorted on one column facilitates the skipping of blocks when tables are scanned and the sorted column is restricted in the query.

A table that is clustered on multiple columns facilitates the skipping of blocks when tables are scanned and two or more cluster columns are restricted in the query.

You cannot sort or cluster on encrypted columns.

Note: Capacity expansion (adding blades to the cluster) results in table rows being rewritten and redistributed on the storage system. If any manual ordering was applied to the data before the expansion, there is no guarantee that the data will remain sorted in that order. In turn any query performance advantage associated with that order will be lost (such as skipping blocks during scans).

Sort Column

You can define a sort column for a Yellowbrick table. The SORT ON (column) clause names a single column to sort on before data is stored on the media. For example:

create table team
(teamid smallint, htid smallint, atid smallint, name varchar(30), nickname varchar(20), city varchar(20), stadium varchar(50), capacity int) 
distribute on (teamid)
sort on (city);

Clustered Tables

You can define up to four cluster columns for a Yellowbrick table. TheCLUSTER ON (column, column, ...) clause names multiple columns to cluster on before data is stored on the media. For example:

create table team
(teamid smallint, htid smallint, atid smallint, name varchar(30), nickname varchar(20), city varchar(20), stadium varchar(50), capacity int) 
distribute on (teamid)
cluster on (name, nickname, city);

A clustered table maintains an internal map that organizes data as a set of points across the dimensions that are defined as cluster columns. This map optimizes query performance because the optimizer can skip the blocks of data that do not qualify for multiple restrictions. One pass over the data is sufficient to find the intersection of rows that meet the WHERE clause criteria, and a large amount of data in the base table can be eliminated very early in the query plan.

Parent topic:CREATE TABLE