Appearance
Running a Bulk Load
This section summarizes the steps required to run a bulk load into a Yellowbrick table. For more details, see ybload Command and ybload Options.
To run a bulk load:
Make sure the target Yellowbrick database is up and running and the
ybload
client is installed (as part of theybtools
package).Create the table in the database before running the load operation. Use the CREATE TABLE statement (or CREATE TABLE AS or SELECT INTO).
Determine the exact location and format of the source file (or set of source files) for the load. The following source types are supported:
- Files local to the client system where you are running
ybload
stdin
- Named pipes
- Amazon Simple Storage Service (Amazon S3) files. See Loading from Amazon S3.
- Azure Blob object stores. See Loading from Azure Blob Storage. Note: Try to export the data from the source system in the character set that you used when you created the Yellowbrick database. For example, load exported UTF8 data into a UTF8 database. Any required character set translation is likely to have an impact on import performance during the load.
- From a client system where the
ybtools
clients are installed, run theybload
command, as a user account that is not a superuser. Superuser accounts may not have sufficient memory to run bulk load operations.
You must define a few essential pieces of information. On an as-needed basis, you can also specify a range of other processing, logging, and parsing options.
ybload [options] -t [SCHEMA.]TABLENAME SOURCE [SOURCE]...
[options]
: You will need to set some database connection values if they are not already set with environment variables. Other options depend on the nature of the data, the type of load operation you want to run, and related formatting requirements. Consider using the--dryrun
option for a new load operation; this will help you determine the set of options you need in order to load the table successfully. If you want to delete, update, or upsert rows, specify the--write-op
option. For more details, see ybload Options or the online help text.Important: Setting the
--format
option is strongly recommended for all load operations.TABLENAME
: the target table, including its schema name (optional but recommended). If you do not specify the schema name, the table is assumed to be in thepublic
schema. The schema of the target table is not based on the user'ssearch_path
, regardless of how it is set.Important: The user who runs the load must have
BULK LOAD
permissions on the database and INSERT permissions on the table (but not ownership of the table).SOURCEFILE
: one or more source files and their path (locally or on a supported object storage site). Each source file should be a flat file with a consistent field delimiter (comma-separated or tab-delimited, for example) or aparquet
file. Use spaces to separate multiple file references. You can also use wildcard characters (such as*
) to refer to multiple files or a whole directory of files. Single quotes around source file names are optional.For flat files, compressed
gzip
,bzip2
,xz
,pack200
, andlz4
formats may be detected based on their file extensions:.gz
=gzip
.bzip2
or.bz2
=bzip
.xz
=xz
.pack200
=pack200
.lz4
=lz4
ybload
can also detect compression headers within regular compressed files, but the file extension mapping overrides this header detection. The--source-compression
option, if specified, overrides both of these detection methods.
Note:
lz4
compression is not as compact asgzip
but takes less time to both compress and decompress.For
parquet
files, compression is detected automatically based on metadata in the files.At a minimum, each source file should contain fields for every non-nullable column in the table. Files do not have to match the target table exactly, one to one, in terms of fields mapping to columns and the order of the fields. Options are available to work around these mismatches. If the fields in your source file do not align with the columns in the target table, use either
--field-names
or--parse-header-line
to map fields to columns.Note: To load columns with
DEFAULT
values, use the--field-names
option and omit those columns from the list. See ybload Options.Empty source files do not prevent bulk loads from proceeding. If
ybload
encounters an empty source file, the load operation skips to the next file, if any.To load from
stdin
(standard input, an unnamed pipe), enter the single-dash character (-) at the end of theybload
command. This must be the last character on the command line, and it must be prefixed with -- to indicate that option parsing is complete. For example:./ybload -d premdb --username bobr -t match``--
-If your workflow involves multiple processes communicating through named pipes,
ybload
can consume directly from your producer processes. Simply specify the named pipes on theybload
command line in the same way that you would name regular files. Mixing pipe and non-pipe data sources for a load is not recommended; this could cause a deadlock with the program that the pipes communicate with.