Setting --format for Flat Files

The --format option is a mechanism for specifying the expected format of ybload source files, given some assumptions about the standard behavior of the export program that produced those files. Specifying the --format option is always recommended for flat files, but the setting you choose must be based on knowledge of the data and the export program that produced the data.
Note: This section does not apply to structured Parquet files; to load these files, always specify --format parquet in the ybload command. Auto-detection of Parquet files is not supported.
In general, ybload tries to import raw data intelligently, given some assumptions about which characters need to be read in exactly as they are and which characters need to be interpreted as having a special function. These special characters often have to be removed from the source rather than loaded, in a sense undoing what the exporter had done to prepare the data.
In particular, the --format option defines the expected layout of flat files with respect to line separators and field delimiters. Line separators mark the ends of rows, and field delimiters mark the separators between columns. If ybload does not handle these separators and delimiters correctly, the load may fail or reject rows. One of the main problems that choosing the correct format can solve is the handling of embedded delimiter characters, where the delimiter character appears inside actual data values and must be loaded, not removed. A typical example is a comma that appears, on purpose, within character strings in a CSV file.

The --format option has three flat-file settings:

  • CSV: comma-separated values file. This format assumes that embedded delimiters are protected by the use of quotes to wrap entire fields. When a CSV file is loaded, quotes that surround field values will be removed, not loaded.

    For example, the string "Citizens, or Blues" can be loaded as Citizens, or Blues into the nickname column of the team table, despite the fact that the field delimiter is a comma.

  • TEXT: flat text file. This format assumes that delimiters in field values were protected by being preceded with a backslash (\). When a TEXT file is loaded, backslash characters that precede delimiter characters will be removed, not loaded.

    For example, if the field delimiter is a comma, the string Citizens\, or Blues can be loaded into the nickname column of the team table as Citizens, or Blues.

  • BCP: as exported from the Microsoft SQL Server bcp utility. This format assumes that delimiters in fields were not protected; therefore, the incoming data must not have embedded delimiters. The incoming data is read exactly as it is.

Defaults for Other Options When --format Is Set

When the --format option is set, other ybload options have specific default values, as shown in the following table. Regardless of the --format setting, and even if --format is not specified at all, ybload tries to automatically detect the --delimiter and --linesep characters. If the attempt at auto-detection fails, the load continues but uses fallback characters, which may or may not suit the incoming data.

--delimiter , \t ,
--linesep LF LF CRLF
--quote-char " N/A N/A
--escape-char " \ N/A
--nullmarker empty/none \N (two characters) empty/none
--emptymarker empty/none empty/none \0 (1 character)
--convert-c-escape false true false

See also ybload Options.

Notes on --format TEXT Escape Sequences

When you use the TEXT option, ybload does not recognize all escape sequences. For example, the string abc\wdef is loaded literally as abc\wdef, not respecting \w as an escape sequence. (Note that the equivalent ybsql \copy from command with the format text option does recognize this escape sequence, and others like it.)

The --format TEXT option recognizes only the following escape sequences:

  • escape_char escape_char = escape_char

    For example: \\

  • escape_char delimiter = delimiter

    For example: \|

  • escape_char linesep = linesep

    For example: \\n

The ybload escape character defaults to the backslash character (\), but you can change it by setting the --escape-char option.
When you use the TEXT option, ybload defaults to using the --convert-c-escape option. However, this option recognizes and converts only the following escape sequences:
  • \\: backslash
  • \b: backspace
  • \f: formfeed
  • \n: newline
  • \r: carriage return
  • \t: tab
  • \v: vertical tab
All other escape sequences are not converted.
Note: If you change the escape character (--escape-char) and use --format TEXT, the --convert-c-escape option still only looks for backslash \ sequences.