Running a Bulk Load

This section summarizes the steps required to run a bulk load into a Yellowbrick table. For more details, see ybload Command and ybload Options.

To run a bulk load:

Make sure the target Yellowbrick database is up and running and the ybload client is installed (as part of the ybtools package).
Create the table in the database before running the load operation. Use the CREATE TABLE statement (or CREATE TABLE AS or SELECT INTO).
Determine the exact location and format of the source file (or set of source files) for the load. The following source types are supported:

Files local to the client system where you are running ybload
stdin
Named pipes
Amazon Simple Storage Service (Amazon S3) files. See Loading from Amazon S3.
Azure Blob object stores. See Loading from Azure Blob Storage. Note: Try to export the data from the source system in the character set that you used when you created the Yellowbrick database. For example, load exported UTF8 data into a UTF8 database. Any required character set translation is likely to have an impact on import performance during the load.

From a client system where the ybtools clients are installed, run the ybload command, as a user account that is not a superuser. Superuser accounts may not have sufficient memory to run bulk load operations.

You must define a few essential pieces of information. On an as-needed basis, you can also specify a range of other processing, logging, and parsing options.

ybload [options] -t [SCHEMA.]TABLENAME SOURCE [SOURCE]...

[options]: You will need to set some database connection values if they are not already set with environment variables. Other options depend on the nature of the data, the type of load operation you want to run, and related formatting requirements. Consider using the --dryrun option for a new load operation; this will help you determine the set of options you need in order to load the table successfully. If you want to delete, update, or upsert rows, specify the --write-op option. For more details, see ybload Options or the online help text.
Important: Setting the --format option is strongly recommended for all load operations.
TABLENAME: the target table, including its schema name (optional but recommended). If you do not specify the schema name, the table is assumed to be in the public schema. The schema of the target table is not based on the user's search_path, regardless of how it is set.
Important: The user who runs the load must have BULK LOAD permissions on the database and INSERT permissions on the table (but not ownership of the table).
SOURCEFILE: one or more source files and their path (locally or on a supported object storage site). Each source file should be a flat file with a consistent field delimiter (comma-separated or tab-delimited, for example) or a parquet file. Use spaces to separate multiple file references. You can also use wildcard characters (such as *) to refer to multiple files or a whole directory of files. Single quotes around source file names are optional.
For flat files, compressed gzip, bzip2, xz, pack200, and lz4 formats may be detected based on their file extensions:
- .gz = gzip
- .bzip2 or .bz2 = bzip
- .xz = xz
- .pack200 = pack200
- .lz4 = lz4ybload can also detect compression headers within regular compressed files, but the file extension mapping overrides this header detection. The --source-compression option, if specified, overrides both of these detection methods.
Note: lz4 compression is not as compact as gzip but takes less time to both compress and decompress.
For parquet files, compression is detected automatically based on metadata in the files.
At a minimum, each source file should contain fields for every non-nullable column in the table. Files do not have to match the target table exactly, one to one, in terms of fields mapping to columns and the order of the fields. Options are available to work around these mismatches. If the fields in your source file do not align with the columns in the target table, use either --field-names or --parse-header-line to map fields to columns.
Note: To load columns with DEFAULT values, use the --field-names option and omit those columns from the list. See ybload Options.
Empty source files do not prevent bulk loads from proceeding. If ybload encounters an empty source file, the load operation skips to the next file, if any.
To load from stdin (standard input, an unnamed pipe), enter the single-dash character (-) at the end of the ybload command. This must be the last character on the command line, and it must be prefixed with -- to indicate that option parsing is complete. For example: ./ybload -d premdb --username bobr -t match``-- -
If your workflow involves multiple processes communicating through named pipes, ybload can consume directly from your producer processes. Simply specify the named pipes on the ybload command line in the same way that you would name regular files. Mixing pipe and non-pipe data sources for a load is not recommended; this could cause a deadlock with the program that the pipes communicate with.

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

sys.lock

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Loading from Amazon S3

Loading from Azure Blob Storage

Setting up and Running a Spark Job

Setting Up the ybrelay Service

LDAP Authentication

Synchronizing Users and Groups

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

AWS Marketplace

Create Stack

Docker

Cloud: Configuration

Vanity DNS

Yellowbrick Manager

Cloud: Enterprise Edition Getting Started

SQL-Based Loads from External Storage

Cloud: Installation

CLI Install Instructions

Permissions

Private Install Instructions

Public Install Instructions

Cloud: Kubernetes Guides

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

Data Type Casting

DECIMAL

JSON

JSONB

SQL String Constants

Aggregate Functions

Conditional Expressions

Datetime Functions

Formatting Functions

Geospatial functions

Mathematical Functions

Network Address Functions

Pattern Matching

Regular Expression Details

SQL Operators and Pattern Matching Functions

SQL Conditions

SQL User Defined Function (UDF)

SQL UDF Create Function

String Functions

ENCRYPT_KS

System Functions

Type-Safe Casting Functions

Window Functions

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

Running a Bulk Load ​

Running a Bulk Load