Appearance
Running a Bulk Load
This section summarizes the steps required to run a bulk load into a Yellowbrick table. For more details, see ybload Command and ybload Options.
To run a bulk load:
Make sure the target Yellowbrick database is up and running and the
ybload
client is installed (as part of theybtools
package).Create the table in the database before running the load operation. Use the CREATE TABLE statement (or CREATE TABLE AS or SELECT INTO).
Determine the exact location and format of the source file (or set of source files) for the load. The following source types are supported:
- Files local to the client system where you are running
ybload
stdin
- Named pipes
- Amazon Simple Storage Service (Amazon S3) files (See Loading from Amazon S3.) Note: Try to export the data from the source system in the character set that you used when you created the Yellowbrick database. For example, load exported UTF8 data into a UTF8 database. Any required character set translation is likely to have an impact on import performance during the load.
- From a client system where the
ybtools
clients are installed, run theybload
command, as a user account that is not a superuser. For example, the defaultyellowbrick
user is a superuser. Superuser accounts may not have sufficient memory to run bulk load operations.
You must define a few essential pieces of information. On an as-needed basis, you can also specify a range of other processing, logging, and parsing options.
ybload [options] -t [SCHEMA.]TABLENAME SOURCE [SOURCE]...
[options]
: You will need to set some database connection values if they are not already set with environment variables. Other options depend on the nature of the data and related formatting requirements for the load. Consider using the--dryrun
option for a new load operation; this will help you determine the options you need in order to load the table successfully. For details, see ybload Options or the online help text.Important: Setting the
--format
option is strongly recommended for all load operations.TABLENAME
: the target table, including its schema name (optional but recommended). If you do not specify the schema name, the table is assumed to be in thepublic
schema. The schema of the target table is not based on the user'ssearch_path
, regardless of how it is set.Important: The user who runs the load must have INSERT permissions on the table (but does not have to own the table).
SOURCEFILE
: one or more source files and their path (locally or on a supported object storage site). Each source file should be a flat file with a consistent field delimiter (comma-separated or tab-delimited, for example). Use spaces to separate multiple file references. You can also use wildcard characters (such as*
) to refer to multiple files or a whole directory of files. Single quotes around source file names are optional.Compressed
gzip
,bzip2
,xz
,pack200
, andlz4
formats may be detected based on their file extensions:.gz
=gzip
.bzip2
or.bz2
=bzip
.xz
=xz
.pack200
=pack200
.lz4
=lz4
ybload
can also detect compression headers within regular compressed files, but the file extension mapping overrides this header detection. The--source-compression
option, if specified, overrides both of these detection methods.
Note:
lz4
compression is not as compact asgzip
but takes less time to both compress and decompress.At a minimum, each source file should contain fields for every non-nullable column in the table. Files do not have to match the target table exactly, one to one, in terms of fields mapping to columns and the order of the fields. Options are available to work around these mismatches. If the fields in your source file do not align with the columns in the target table, use either
--field-names
or--parse-header-line
to map fields to columns.Note: To load columns with
DEFAULT
values, use the--field-names
option and omit those columns from the list. See ybload Options.Empty source files do not prevent bulk loads from proceeding. If
ybload
encounters an empty source file, the load operation skips to the next file, if any.To load from
stdin
(standard input, an unnamed pipe), enter the single-dash character (-) at the end of theybload
command. This must be the last character on the command line, and it must be prefixed with -- to indicate that option parsing is complete. For example:./ybload -d premdb --username bobr -t match``--
-If your workflow involves multiple processes communicating through named pipes,
ybload
can consume directly from your producer processes. Simply specify the named pipes on theybload
command line in the same way that you would name regular files. Mixing pipe and non-pipe data sources for a load is not recommended; this could cause a deadlock with the program that the pipes communicate with.
In This Section
Parent topic:Bulk Loading Tables