Appearance
Setting --format for Flat Files
The --format
option is a mechanism for specifying the expected format of ybload
source files, given some assumptions about the standard behavior of the export program that produced those files. Specifying the --format
option is always recommended for flat files, but the setting you choose must be based on knowledge of the data and the export program that produced the data.
Note: This section does not apply to structured Parquet files; to load these files, always specify --format parquet
in the ybload
command. Auto-detection of Parquet files is not supported.
In general, ybload
tries to import raw data intelligently, given some assumptions about which characters need to be read in exactly as they are and which characters need to be interpreted as having a special function. These special characters often have to be removed from the source rather than loaded, in a sense undoing what the exporter had done to prepare the data.
In particular, the --format
option defines the expected layout of flat files with respect to line separators and field delimiters. Line separators mark the ends of rows, and field delimiters mark the separators between columns. If ybload
does not handle these separators and delimiters correctly, the load may fail or reject rows. One of the main problems that choosing the correct format can solve is the handling of embedded delimiter characters, where the delimiter character appears inside actual data values and must be loaded, not removed. A typical example is a comma that appears, on purpose, within character strings in a CSV file.
The --format
option has three flat-file settings:
CSV
: comma-separated values file. This format assumes that embedded delimiters are protected by the use of quotes to wrap entire fields. When aCSV
file is loaded, quotes that surround field values will be removed, not loaded.
For example, the string "Citizens, or Blues"
can be loaded as Citizens, or Blues
into the nickname
column of the team
table, despite the fact that the field delimiter is a comma.
TEXT
: flat text file. This format assumes that delimiters in field values were protected by being preceded with a backslash (\
). When aTEXT
file is loaded, backslash characters that precede delimiter characters will be removed, not loaded.
For example, if the field delimiter is a comma, the string Citizens\, or Blues
can be loaded into the nickname
column of the team
table as Citizens, or Blues
.
BCP
: as exported from the Microsoft SQL Serverbcp
utility. This format assumes that delimiters in fields were not protected; therefore, the incoming data must not have embedded delimiters. The incoming data is read exactly as it is.
Defaults for Other Options When --format Is Set
When the --format
option is set, other ybload
options have specific default values, as shown in the following table. Regardless of the --format
setting, and even if --format
is not specified at all, ybload
tries to automatically detect the --delimiter
and --linesep
characters. If the attempt at auto-detection fails, the load continues but uses fallback characters, which may or may not suit the incoming data.
Option | CSV | TEXT | BCP |
---|---|---|---|
--delimiter | , | \t | , |
--linesep | LF | LF | CRLF |
--quote-char | " | N/A | N/A |
--escape-char | " | \ | N/A |
--nullmarker | empty/none | \N (two characters) | empty/none |
--emptymarker | empty/none | empty/none | \0 (1 character) |
--convert-c-escape | false | true | false |
See also ybload Options.
Notes on --format TEXT Escape Sequences
When you use the TEXT
option, ybload
does not recognize all escape sequences. For example, the string abc\wdef
is loaded literally as abc\wdef
, not respecting \w
as an escape sequence. (Note that the equivalent ybsql \copy from
command with the format text
option does recognize this escape sequence, and others like it.)
The --format TEXT
option recognizes only the following escape sequences:
escape_char escape_char = escape_char
For example: \\
escape_char delimiter = delimiter
For example: \|
escape_char linesep = linesep
For example: \\n
The ybload
escape character defaults to the backslash character (\
), but you can change it by setting the --escape-char
option.
When you use the TEXT
option, ybload
defaults to using the --convert-c-escape
option. However, this option recognizes and converts only the following escape sequences:
\\
: backslash\b
: backspace\f
: formfeed\n
: newline\r
: carriage return\t
: tab\v
: vertical tab
All other escape sequences are not converted.
Note: If you change the escape character (--escape-char
) and use --format TEXT
, the --convert-c-escape
option still only looks for backslash \
sequences.