Setting --format for Flat Files
--format
option is a mechanism for
specifying the expected format of ybload
source files, given some assumptions
about the standard behavior of the export program that produced those files. Specifying the
--format
option is always recommended for flat files, but the setting you
choose must be based on knowledge of the data and the export program that produced the data.
--format parquet
in the ybload
command.
Auto-detection of Parquet files is not supported. ybload
tries to import raw data
intelligently, given some assumptions about which characters need to be read in exactly as
they are and which characters need to be interpreted as having a special function. These
special characters often have to be removed from the source rather than loaded, in a sense
undoing what the exporter had done to prepare the data. --format
option defines the expected layout of flat
files with respect to line separators and field delimiters. Line separators mark the ends of
rows, and field delimiters mark the separators between columns. If ybload
does not handle these separators and delimiters correctly, the load may fail or reject rows.
One of the main problems that choosing the correct format can solve is the handling of
embedded delimiter characters, where the delimiter character appears inside actual data values
and must be loaded, not removed. A typical example is a comma that appears, on purpose, within
character strings in a CSV file. The --format
option has three flat-file
settings:
CSV
: comma-separated values file. This format assumes that embedded delimiters are protected by the use of quotes to wrap entire fields. When aCSV
file is loaded, quotes that surround field values will be removed, not loaded.For example, the string
"Citizens, or Blues"
can be loaded asCitizens, or Blues
into thenickname
column of theteam
table, despite the fact that the field delimiter is a comma.TEXT
: flat text file. This format assumes that delimiters in field values were protected by being preceded with a backslash (\
). When aTEXT
file is loaded, backslash characters that precede delimiter characters will be removed, not loaded.For example, if the field delimiter is a comma, the string
Citizens\, or Blues
can be loaded into thenickname
column of theteam
table asCitizens, or Blues
.BCP
: as exported from the Microsoft SQL Serverbcp
utility. This format assumes that delimiters in fields were not protected; therefore, the incoming data must not have embedded delimiters. The incoming data is read exactly as it is.
Defaults for Other Options When --format Is Set
When the --format
option is set, other ybload
options
have specific default values, as shown in the following table. Regardless of the
--format
setting, and even if --format
is not specified
at all, ybload
tries to automatically detect the
--delimiter
and --linesep
characters. If the attempt at
auto-detection fails, the load continues but uses fallback characters, which may or may not
suit the incoming data.
Option | CSV | TEXT | BCP |
---|---|---|---|
--delimiter |
, |
\t |
, |
--linesep |
LF |
LF |
CRLF |
--quote-char |
" |
N/A | N/A |
--escape-char |
" |
\ |
N/A |
--nullmarker |
empty/none | \N (two characters) |
empty/none |
--emptymarker |
empty/none | empty/none | \0 (1 character) |
--convert-c-escape |
false |
true |
false |
See also ybload Options.
Notes on --format TEXT Escape Sequences
When you use the TEXT
option, ybload
does not recognize
all escape sequences. For example, the string abc\wdef
is loaded literally
as abc\wdef
, not respecting \w
as an escape sequence.
(Note that the equivalent ybsql \copy from
command with the format
text
option does recognize this escape sequence, and others like it.)
The --format TEXT
option recognizes only the following escape sequences:
escape_char escape_char = escape_char
For example:
\\
escape_char delimiter = delimiter
For example:
\|
escape_char linesep = linesep
For example:
\\n
ybload
escape character defaults to the backslash character
(\
), but you can change it by setting the --escape-char
option. TEXT
option, ybload
defaults to using
the --convert-c-escape
option. However, this option recognizes and converts
only the following escape sequences: \\
: backslash\b
: backspace\f
: formfeed\n
: newline\r
: carriage return\t
: tab\v
: vertical tab
--escape-char
) and use --format TEXT
, the
--convert-c-escape
option still only looks for backslash
\
sequences.