Skip to content

Parquet Processing Options

This section covers ybunload options that are specific to unloads in parquet format. These options do not apply to unloads in csv and text format.

--disable-column-aliases, --no-disable-column-aliases

Do not retain column aliases that were specified in a SQL statement for a ybunload operation. Column names are automatically generated and used in place of any aliases to form the schema in the unloaded parquet files. This option only applies to unloads in parquet format. The default behavior is to retain column aliases.

For example, assume that the ybunload command specifies this query:

-s "select seasonid sid, matchday mday, ftscore ft, htscore ht from match order by 1"

If the --disable-column-aliases option is also specified, the aliases (sid, mday, ft, ht) in the query are not retained in the output files:

% parquet-tools schema DisableSelectAliases_2_0_.parquet
message schema {
  optional int32 seasonid (INTEGER(16,true));
  optional int64 matchday (TIMESTAMP(MICROS,true));
  optional binary ftscore (STRING);
  optional binary htscore (STRING);
}

If --disable-column-aliases is not specified (or --no-disable-column-aliases is specified), the aliases are retained:

% parquet-tools schema NoDisableSelectAliases_1_0_.parquet
message schema {
  optional int32 sid (INTEGER(16,true));
  optional int64 mday (TIMESTAMP(MICROS,true));
  optional binary ft (STRING);
  optional binary ht (STRING);
}
--disable-column-names, --no-disable-column-names

Do not retain column names that were specified in the DDL for a table that is being unloaded. Column names are automatically generated and used in place of the DDL column names to form the header row in the unloaded parquet files. This option only applies to unloads in parquet format. The default behavior is to retain column names.

For example, an unload of the match table with --disable-column-names specified will produce the following parquet schema:

% parquet-tools schema match_table_disablecols_1_0_.parquet     
message schema {
  optional int32 _COL_1 (INTEGER(16,true));
  optional int64 _COL_2 (TIMESTAMP(MICROS,true));
  optional int32 _COL_3 (INTEGER(16,true));
  optional int32 _COL_4 (INTEGER(16,true));
  optional binary _COL_5 (STRING);
  optional binary _COL_6 (STRING);

The _COL_* column names replace the DDL column names:

premdb=# \d match
                Table "public.match"
  Column  |            Type             | Modifiers 
----------+-----------------------------+-----------
 seasonid | smallint                    | 
 matchday | timestamp without time zone | 
 htid     | smallint                    | 
 atid     | smallint                    | 
 ftscore  | character(3)                | 
 htscore  | character(3)                |

If the --parquet-column-prefix option is also specified, the column names are replaced accordingly. For example, --parquet-column-prefix 'MatchCol' produces:

% parquet-tools schema match_table_prefix_disablecols_1_0_.parquet
message schema {
  optional int32 MatchCol1 (INTEGER(16,true));
  optional int64 MatchCol2 (TIMESTAMP(MICROS,true));
  optional int32 MatchCol3 (INTEGER(16,true));
  optional int32 MatchCol4 (INTEGER(16,true));
  optional binary MatchCol5 (STRING);
  optional binary MatchCol6 (STRING);
}
--opt-tmp-buffer-size NUMBER

Set the buffer size (in bytes) to be used for string operations such as transcoding. The default size is 3145728, which is large enough, in most cases, to hold the maximum width of an unloaded row, given that the size of the row is multiplied for transcoding purposes. In some cases, the default buffer size may not be big enough. If ybunload returns a prompt to adjust this buffer, you can increase it; otherwise, use the default value. This option is similar to the --pre-parse-buf-size option in ybload.

--parquet-column-prefix STRING

Supply a column prefix that overrides the default prefix (_COL_) used to generate and disambiguate column names in the schema of unloaded parquet files. Column names in a parquet schema must be unique. For example, if a join query used in a ybunload command produces multiple columns with the same name, a prefix must be used to make them unique. When duplicate column names are found, the second column name and any additional duplicates are renamed with the prefix and a number (for example, _COL_2, _COL_3).

For example, in this case a simple cross-join of tables t1, t2, and t3 produces three output columns named c1:

premdb=# select * from t1, t2, t3;
 c1 | c1 | c1  
----+----+-----
  1 | 10 | 100
(1 row)

An unload of this row in parquet format with the default column prefix produces the following results:

% parquet-tools schema unload_2_0_.parquet
message schema {
  optional int32 c1 (INTEGER(32,true));
  optional int32 _COL_2 (INTEGER(32,true));
  optional int32 _COL_3 (INTEGER(32,true));
}
% parquet-tools head unload_2_0_4.parquet
c1 = 1
_COL_2 = 10
_COL_3 = 100

If you set --parquet-column-prefix 'duplicate' in the ybunload command, the columns will be named as follows:

% parquet-tools schema unload_2_0_.parquet 
message schema {
  optional int32 c1 (INTEGER(32,true));
  optional int32 duplicate2 (INTEGER(32,true));
  optional int32 duplicate3 (INTEGER(32,true));
}
% parquet-tools head unload_2_0_.parquet
c1 = 1
duplicate2 = 10
duplicate3 = 100
--parquet-disable-dictionary-compression, --no-parquet-disable-dictionary-compression

Disable dictionary compression. Dictionary compression is enabled by default for parquet unloads.

--parquet-rowgroup-size NUMBER

Set the size, in bytes, of a parquet row group:

  • Default: 52428800
  • Maximum size: 1073741824
  • Minimum size: 1048576

A parquet row group is a segment of a parquet file (similar to a shard of data stored for a Yellowbrick table). A row group contains a set of rows and associated statistics, such as minimum and maximum values. You may need to limit the row group size for the following reasons:

  • The row group size directly affects unload performance. Memory usage on the ybunload client will be proportionally higher with an increase in row group size.

    Skipping during reloads of unloaded parquet files will be limited if the range in maximum and minimum values for a row group is too large. For example, if all the rows for an unloaded table are written to the same row group in a parquet file, a subsequent load from that file cannot benefit from skipping. Other query planning optimizations, such as predicate pushdown, may be inhibited as well.

  • ybunload may buffer up to one row group per blade. Increasing the row group size has a direct impact on ybunload memory usage, and for large systems, may cause ybunload to run out of memory.

--parquet-write-buffer-size NUMBER

Parquet write buffer size (in bytes). Default: 8192

Parent topic:Unloading Data to Parquet Files