ybunload Output Files
This section describes the output files that ybunload
exports to the client.
Naming of Output Files
Use the --prefix
option to give unique names to ybunload
output files. If you do not use this option, files are named with the default unload
prefix. When an unload operation produces multiple files, they are numbered consecutively. For example:
unload_1_1_.csv
unload_1_2_.csv
unload_1_3_.csv
...
The convention for naming files is as follows:
<prefix_><streamID_><partnumber_>.<extension>
prefix_
: As defined, orunload
by default.stream id_
: A number assigned to each data stream from the workers. The streams are not in any particular order relative to a specific worker, and a single worker may provide multiple streams.partnumber_
: An incrementing number starting from1
for each stream..extension
: file type, such as.csv
or.gz
.
Note: By default, unloaded GZIP
compressed files do not have a file extension (such as .csv
or .txt
) when you unzip them. If necessary, you can use the gunzip
command with the -c
option to unzip and rename each file. For example:
$ gunzip -c unload_1_1_.gz > unload_1_1_.txt
Number and Size of Output Files
The number and size of files generated to complete the unload depends on the following factors:
max_file_size
setting for theybunload
command. If a file reaches this limit (by default, 50GB for regular file systems and 60GB for S3),ybunload
closes that file and starts writing to a new file. This occurs as many times as necessary to complete the unload. Themax_file_size
value is not the maximum size of the unload; it is the maximum size of a single file generated by an unload stream from a single worker. Each worker unloads a separate data stream.- The plan that is generated for the unload query and how many worker nodes have data at the top of the plan. A
ybunload
plan is the same as aSELECT
query plan except that the top of the plan tree has a "data output" node. - The compression (
--compress
) option that is chosen for theybunload
command.
Compressed and Uncompressed Files
Unload files are compressed in either "block mode" or "stream mode."
The GZIP_*
compression options operate in block mode, which consolidates output files as much as possible, such that the number of files exported back to the client is no greater than it would be without the use of compression. The size of compressed versus uncompressed files will differ, but the number is the same. Block mode compression (or no compression) results in one file per worker node that has data at the top of the plan tree. If a worker node is not used in the plan at all or produces no data at the top of the plan, it does not contribute a file. Yellowbrick recommends the use of block mode GZIP options in most cases.
The GZIP_STREAM_*
options operate in stream mode. Stream mode compression results in potentially more files being exported: typically one file per worker node per core. The GZIP_STREAM_*
options are intended to be used only if your downstream workflow tools cannot handle gzip
files containing multiple compression blocks. Additionally, the GZIP_STREAM_*
options consume significantly more network connections than the GZIP*
options, meaning many network routers won't be able to handle the increased number of connections reliably.
Where possible, all workers and CPU cores are used for unload queries. Regardless of the compression option that you use, the following exceptions apply:
- Single-worker queries: Only one worker executes the plan, so only as many files as there are cores on that worker are exported.
- Queries that select only part of a data set: If the unload query constrains a narrow segment of data (such as a date range or a set of low-cardinality values), only a subset of the workers may contain any data to be unloaded, resulting in fewer output files.
- Queries involving sorts that do not specify the
--parallel
option in theybunload
command. In this case, a final sort on a single worker and a single core will yield only one output file. (To guarantee a single-file unload, make sure the unload query has anORDER BY
clause and do not specify the--parallel
option in the command.) - Queries involving aggregates. The final aggregation phase of the plan is done on one worker, resulting in a reduced number of files, similar to the sort query case.
To summarize, in most cases the number of output files produced by an unload query will be one file per worker that has data at the top of the plan. If you change the max_file_size
setting, the unload may generate more files or fewer files. For example, if you are doing a 6TB unload (uncompressed) from a 15-node cluster with perfect data distribution, that would be ~410GB per blade. With a default max_file_size
of 50GB, the unload would produce 9 files per blade for a total of 135 files. To reduce the number of output files for the unload to 15, you could specify a larger max_file_size
, such as 500GB.
Specific Compression Options
Block mode:
GZIP
andGZIP_FAST
are synonyms (for the fastest compression option).GZIP_MORE
provides better compression, but slower unload performanceGZIP_BEST
provides the best compression but much slower performance.
Stream mode:
GZIP_STREAM
andGZIP_STREAM_FAST
are synonyms (for the fastest compression option).GZIP_STREAM_MORE
provides better compression, but slower unload performanceGZIP_STREAM_BEST
provides the best compression but much slower performance.
Parent topic:ybunload Command