Skip to content

Parquet Processing Options

This section covers ybunload options that are specific to unloads in parquet format. These options do not apply to unloads in csv and text format.

--opt-tmp-buffer-size NUMBER

Set the buffer size (in bytes) to be used for string operations such as transcoding. The default size is 3145728, which is large enough, in most cases, to hold the maximum width of an unloaded row, given that the size of the row is multiplied for transcoding purposes. In some cases, the default buffer size may not be big enough. If ybunload returns a prompt to adjust this buffer, you can increase it; otherwise, use the default value. This option is similar to the --pre-parse-buf-size option in ybload.

--parquet-disable-dictionary-compression, --no-parquet-disable-dictionary-compression

Disable dictionary compression. Dictionary compression is enabled by default for parquet unloads.

--parquet-rowgroup-size NUMBER

Set the size, in bytes, of a parquet row group:

  • Default: 52428800
  • Maximum size: 1073741824
  • Minimum size: 1048576

A parquet row group is a segment of a parquet file (similar to a shard of data stored for a Yellowbrick table). A row group contains a set of rows and associated statistics, such as minimum and maximum values. You may need to limit the row group size for the following reasons:

  • The row group size directly affects unload performance. Memory usage on the ybunload client will be proportionally higher with an increase in row group size.

    Skipping during reloads of unloaded parquet files will be limited if the range in maximum and minimum values for a row group is too large. For example, if all the rows for an unloaded table are written to the same row group in a parquet file, a subsequent load from that file cannot benefit from skipping. Other query planning optimizations, such as predicate pushdown, may be inhibited as well.

  • ybunload may buffer up to one row group per blade. Increasing the row group size has a direct impact on ybunload memory usage, and for large systems, may cause ybunload to run out of memory.

--parquet-write-buffer-size NUMBER

Parquet write buffer size (in bytes). Default: 8192