Appearance
Parquet Processing Options
This section covers ybunload
options that are specific to unloads in parquet
format. These options do not apply to unloads in csv
and text
format.
- --disable-column-aliases, --no-disable-column-aliases
Do not retain column aliases that were specified in a SQL statement for a
ybunload
operation. Column names are automatically generated and used in place of any aliases to form the schema in the unloadedparquet
files. This option only applies to unloads inparquet
format. The default behavior is to retain column aliases.For example, assume that the
ybunload
command specifies this query:-s "select seasonid sid, matchday mday, ftscore ft, htscore ht from match order by 1"
If the
--disable-column-aliases
option is also specified, the aliases (sid
,mday
,ft
,ht
) in the query are not retained in the output files:% parquet-tools schema DisableSelectAliases_2_0_.parquet message schema { optional int32 seasonid (INTEGER(16,true)); optional int64 matchday (TIMESTAMP(MICROS,true)); optional binary ftscore (STRING); optional binary htscore (STRING); }
If
--disable-column-aliases
is not specified (or--no-disable-column-aliases
is specified), the aliases are retained:% parquet-tools schema NoDisableSelectAliases_1_0_.parquet message schema { optional int32 sid (INTEGER(16,true)); optional int64 mday (TIMESTAMP(MICROS,true)); optional binary ft (STRING); optional binary ht (STRING); }
- --disable-column-names, --no-disable-column-names
Do not retain column names that were specified in the DDL for a table that is being unloaded. Column names are automatically generated and used in place of the DDL column names to form the header row in the unloaded parquet files. This option only applies to unloads in
parquet
format. The default behavior is to retain column names.For example, an unload of the
match
table with--disable-column-names
specified will produce the following parquet schema:% parquet-tools schema match_table_disablecols_1_0_.parquet message schema { optional int32 _COL_1 (INTEGER(16,true)); optional int64 _COL_2 (TIMESTAMP(MICROS,true)); optional int32 _COL_3 (INTEGER(16,true)); optional int32 _COL_4 (INTEGER(16,true)); optional binary _COL_5 (STRING); optional binary _COL_6 (STRING);
The
_COL_*
column names replace the DDL column names:premdb=# \d match Table "public.match" Column | Type | Modifiers ----------+-----------------------------+----------- seasonid | smallint | matchday | timestamp without time zone | htid | smallint | atid | smallint | ftscore | character(3) | htscore | character(3) |
If the
--parquet-column-prefix
option is also specified, the column names are replaced accordingly. For example,--parquet-column-prefix 'MatchCol'
produces:% parquet-tools schema match_table_prefix_disablecols_1_0_.parquet message schema { optional int32 MatchCol1 (INTEGER(16,true)); optional int64 MatchCol2 (TIMESTAMP(MICROS,true)); optional int32 MatchCol3 (INTEGER(16,true)); optional int32 MatchCol4 (INTEGER(16,true)); optional binary MatchCol5 (STRING); optional binary MatchCol6 (STRING); }
- --opt-tmp-buffer-size NUMBER
Set the buffer size (in bytes) to be used for string operations such as transcoding. The default size is
3145728
, which is large enough, in most cases, to hold the maximum width of an unloaded row, given that the size of the row is multiplied for transcoding purposes. In some cases, the default buffer size may not be big enough. Ifybunload
returns a prompt to adjust this buffer, you can increase it; otherwise, use the default value. This option is similar to the--pre-parse-buf-size
option inybload
.- --parquet-column-prefix STRING
Supply a column prefix that overrides the default prefix (
_COL_
) used to generate and disambiguate column names in the schema of unloadedparquet
files. Column names in a parquet schema must be unique. For example, if a join query used in aybunload
command produces multiple columns with the same name, a prefix must be used to make them unique. When duplicate column names are found, the second column name and any additional duplicates are renamed with the prefix and a number (for example,_COL_2
,_COL_3
).For example, in this case a simple cross-join of tables
t1
,t2
, andt3
produces three output columns namedc1
:premdb=# select * from t1, t2, t3; c1 | c1 | c1 ----+----+----- 1 | 10 | 100 (1 row)
An unload of this row in
parquet
format with the default column prefix produces the following results:% parquet-tools schema unload_2_0_.parquet message schema { optional int32 c1 (INTEGER(32,true)); optional int32 _COL_2 (INTEGER(32,true)); optional int32 _COL_3 (INTEGER(32,true)); } % parquet-tools head unload_2_0_4.parquet c1 = 1 _COL_2 = 10 _COL_3 = 100
If you set
--parquet-column-prefix 'duplicate'
in theybunload
command, the columns will be named as follows:% parquet-tools schema unload_2_0_.parquet message schema { optional int32 c1 (INTEGER(32,true)); optional int32 duplicate2 (INTEGER(32,true)); optional int32 duplicate3 (INTEGER(32,true)); } % parquet-tools head unload_2_0_.parquet c1 = 1 duplicate2 = 10 duplicate3 = 100
- --parquet-disable-dictionary-compression, --no-parquet-disable-dictionary-compression
Disable dictionary compression. Dictionary compression is enabled by default for parquet unloads.
- --parquet-rowgroup-size NUMBER
Set the size, in bytes, of a parquet row group:
- Default:
52428800
- Maximum size:
1073741824
- Minimum size:
1048576
A parquet row group is a segment of a parquet file (similar to a shard of data stored for a Yellowbrick table). A row group contains a set of rows and associated statistics, such as minimum and maximum values. You may need to limit the row group size for the following reasons:
The row group size directly affects unload performance. Memory usage on the
ybunload
client will be proportionally higher with an increase in row group size.Skipping during reloads of unloaded parquet files will be limited if the range in maximum and minimum values for a row group is too large. For example, if all the rows for an unloaded table are written to the same row group in a parquet file, a subsequent load from that file cannot benefit from skipping. Other query planning optimizations, such as predicate pushdown, may be inhibited as well.
ybunload
may buffer up to one row group per blade. Increasing the row group size has a direct impact onybunload
memory usage, and for large systems, may causeybunload
to run out of memory.
- Default:
- --parquet-write-buffer-size NUMBER
Parquet write buffer size (in bytes). Default: 8192
Parent topic:Unloading Data to Parquet Files