ybload Advanced Processing Options

The options described in this section are advanced data processing options that you can use to optimize bulk load performance. You can run a specific load operation with default settings and capture statistics for comparison. You may need to try several different configurations before finding the optimal combination of option settings. Before using these options, you need a basic understanding of the phases of a bulk load operation.

Phases of a Bulk Load Operation

A ybload operation consists of five main data-processing phases that all occur on the client machine:
  1. Read (or "slurp"): read raw bytes, separate into rows and fields
  2. Parse: interpret field strings and convert them into a machine-readable binary representation. The parse phase is more resource-intensive than the read phase and takes longer to do; it is beneficial to add parallelism at this stage.
  3. Encode: combine multiple parsed rows into a binary packet that can be sent to the database
  4. Compress (optional): compress binary packets before sending
  5. Send: send packets over the network

The main performance goal is to maximize the write rate in the final send stage (as reported at the end of every successful ybload operation).

Factors that can limit the write rate include:
  • Speed of network card or other network hardware:
    • A 1GB network card will never be able to write more than ~120MB/sec
    • Sharing the network with other network-intensive tasks reduces the ybload capacity
  • Complexity of data types in source files: for example, a file with lots of TIMESTAMP fields will parse slower than a file with lots of integers.
  • Complexity of CSV options required to properly load the source file: for example, --allowcontrolascii requires extra processing.
  • Speed of disk where the source files are stored: network disk (such as NFS) is slower than HDD, and HDD is slower than SSD.

To compensate for these factors, ybload operations can be tuned to exercise more cores during the read and parse phases. All of the available cores should be used to their saturation point. The following section explains how to set some options for parallel reading and parsing.

Parallel Reading of Source Files

Reading files in parallel maximizes throughput during a load. A ybload operation can assign parallel reader processes to a set of source files in two different ways:
  • Readers can be spread within source files. This method is known as "striding." Multiple readers start reading the same file from different points in the file, then "leap-frog" each other through the file until the whole file is read. Source files are processed sequentially.
  • Readers can be spread across files, such that multiple readers read individual files separately, from beginning to end (without striding). In this case, the source files are read concurrently, with different readers starting work on different files at the same time.

In some cases, especially when running on very large computers, ybload will use both of these methods to obtain even faster load speeds. The --read-sources-concurrently option can be used to achieve this behavior.

The --read-sources-concurrently option defines a policy for parallel reading of source files, which is a way to control the amount of load concurrency (resulting in potentially faster load speed). However, faster loads may have to be balanced against the need to preserve data locality (resulting in potentially faster query speed).

For example, to maximize query performance very large tables often require tuning (in the form of effective distribution and/or partitioning choices). You might partition a table by event_date so that queries that filter on that column run faster. If you are doing this kind of tuning, you would also want to load data in sorted order (for example, one file per day with records inside each file sorted by time). If ybload can preserve the sorted order of the data while bulk-inserting rows, the data will cluster well in the new shards on the storage system, in turn preserving fast query performance. However, if ybload were to insert a handful of rows from one date, followed by rows from a different date, and then go back to the first date, the data in the new shards would be mixed and query performance would suffer.

Default ALLOW Setting

When you set the --read-sources-concurrently option, you need to consider whether data locality is an issue; it may be relevant to some workloads or some tables, but not all. Given that ybload cannot know whether data locality or load speed is more important for a given table, the default setting (ALLOW) supports a hybrid approach to concurrency that is based on the following assumptions:
  • Separate files contain data from separate periods; therefore, loading them one at a time will best preserve data locality.
  • Preserving data locality (by reading one source file at a time) is more important as long as load performance does not suffer too much. How much performance will suffer is a function of the ratio of “slow” files to “fast” files. If more than 50% of the sources are “slow”, they will be read concurrently. In this context, "slow" files are:
    • Compressed files: slower to read than uncompressed files because of the CPU cost of decompression
    • Remote files loaded via NFS or from object storage (nfs://, s3://, azure://): often slower to read than local files because of file server load and network overhead.
    • Streamed files (STDIN or named pipes): often slower to read because the data is being generated on the fly by another program.
ALLOW is a good choice when you are not sure about the data you are loading and you want ybload to make a decision for you. The ybload operation will attempt to keep the client system busy by using a combination of concurrent reading across files and parallel reading within files. It will favor sequential file reading in order to preserve data locality, but will allow concurrent file reading in order to keep the client system busy.

Note that the size and number of source files are not factors that you should consider when choosing between concurrent and sequential reading.

Alternative Settings for --read-sources-concurrently

The default ALLOW setting is appropriate for most bulk loads, but the following choices are also available:
  • ALWAYS: always read a set of files concurrently, without striding within files. This is required when ybload is reading from multiple named pipes. In rare cases it can also be used to increase ybload throughput when ALLOW is not aggressive enough in the decisions it makes. This option maximizes load speed (at the cost of data locality and future query speed). ybload will automatically choose the actual number of sources to read concurrently (in an attempt to maximize load speed). For example, even if you have 1,000 source files, ybload may choose to concurrently read from only 3 of them at a time.
  • <NUMBER>: similar behavior to ALWAYS, but the user specifies how many of the source files to read concurrently.
    Note: This option is an advanced feature that should only be used to fine-tune load speed. The proper number to specify should be determined from test runs with different numbers. More is not always better, even when the goal is to maximize load speed.

    If ybload is sharing a computer with other CPU-intensive or disk-intensive programs, you might even use this mechanism to reduce the read concurrency used by ybloadand leave some disk/CPU resources for other programs to consume.

  • NEVER: never read a set of files concurrently; use striding to finish reading each file in sequence. This option is a good choice when temporal locality of the data is important. If you have source files that are pre-sorted by a time-based column and the table you are loading has a SORT constraint defined on that column, reading files sequentially one at a time (while allocating parallel readers to work within each file) will maintain the order of the data and provide better load performance. Data locality and future read speed are maximized at the cost of load speed.

Number of Readers and Parsers per Reader

In the context of a ybload operation, the number of readers (--num-readers) defines the degree of parallel reading on the client system. In the multi-striding case, this means the number of parallel readers per file; in the non-striding case, it means the number of readers available to process a set of files concurrently.

You can increase parallelism during the parse phase by setting the number of parsers per reader (--num-parsers-per-reader). The number of readers and number of parsers per reader both affect the shape of the pipeline: how the data is consumed from the source and processed before being sent over the network:

  • If both options are set to 1 (--num-readers 1 --num-parsers-per-reader 1), the shape looks like this:
    Parallel reading is disabled in this case.
  • If you increase the number of parsers (--num-readers 1 --num-parsers-per-reader 2), the shape changes as follows:
          /-> PARSE -> ENCODE -> COMPRESS -\
    READ -                                  -> SEND
          \-> PARSE -> ENCODE -> COMPRESS -/
    The parse phase runs in parallel.
  • If you also increase the number of readers (--num-readers 2 --num-parsers-per-reader 2), the shape changes as follows:
          /-> PARSE -> ENCODE -> COMPRESS -\
    READ -                                  --\
          \-> PARSE -> ENCODE -> COMPRESS -/   \
                                                --> SEND
          /-> PARSE -> ENCODE -> COMPRESS -\   /
    READ -                                  --/
          \-> PARSE -> ENCODE -> COMPRESS -/
    The read and parsing phases both run in parallel.

Best Practices for Optimizing Parallel Reading

By default, --num-readers is set to the number of cores divided by 2. For example, on a client system with 8 cores, the default number of readers is 4, and on a system with 16 cores, the default is 8. (To disable parallel reading, set --num-readers to 1.) By default, --num-parsers-per-reader is set to 2.

You need to determine the best settings for these options per table that you load, given the variation in size (of the table and its source files) and the data types of the columns. If multiple clients are being used, you also need to consider any variations in the client systems themselves. Additionally, you need to revisit the load configuration whenever a client system changes: for example, after upgrading a client machine from 8 cores with a 1Gb network card to 18 cores with a 10Gb network.

Follow these steps to find the optimal strategy for setting the parallel reading options:
  1. Run a load of a moderately sized file with the following options:
    --num-readers 1 --num-parsers-per-reader 2
    Run the load through to completion for a minimum of 3 to 5 minutes to make sure you capture reasonably accurate timings. Watch (and record) the write rate as reported by ybload.
  2. Re-run the same load with the same file, incrementing --num-parsers-per-reader. Check the write rate again.
  3. Repeat the same load with larger (or smaller) --num-parsers-per-reader settings until you identify the value that returns the highest write rate.
  4. Repeat the same load with an incremented value for --num-readers. Calculate this incremented value as follows:
    Number of cores / chosen number of parsers per reader
    Round down the result. For example:
    16 cores / 4 parsers per reader = 4
    18 cores / 5 parsers per reader = 3
    Record the write rate.
  5. If you are already at maximum network capacity, reduce the number of readers to its lowest number while maintaining that maximum speed. If you have not reached maximum network capacity, increase the number of readers until you reach the maximum or do not see any improvement. Choose the largest --num-readers value that showed improvement over the next smaller one.
  6. Optionally, run the load two more times, using:
    • Number of readers from the previous step and parsers per reader +1
    • Number of readers from the previous step and parsers per reader -1
    Pick the combination that returns the best write rate. Sometimes the best parsers per reader number will be slightly different when readers=1 versus when readers=the value chosen in this procedure.

Setting a Compression Policy

The --compression-policy option defines how the data buffers are formatted on the client before being sent over the network to the worker nodes. This option has three settings:
  • AUTO: the default; if sockets are busy on the target system, compress the data. This is the optimal setting for most bulk loads.
  • ALWAYS: always compress the data before sending it. This setting is sometimes a good choice when other applications require network capacity (as well as ybload), and potentially slowing down bulk loads is not an issue.
    Note: Do not use the ALWAYS setting for data protection. Some packets may still be sent without compression (for example, in the rare case that the compressed packet is larger than the uncompressed packet). Also, LZ4 encryption, which is used for compression, is very easy to reverse and cannot be used to protect the data. To protect the data, run ybload with the --secured option.
  • NEVER: never compress the data before sending. This option is rarely used but may be applicable if compression does not reduce network bandwidth when loading a particular table.