Skip to content

Spark Application Options

The options listed in this section are Spark application settings that you can use when you submit a Spark job that exports data to a Yellowbrick database. Some of these settings are required.

Required Settings

Every Spark job that exports data to Yellowbrick must define the format and location of the incoming data and connectivity settings for both the Yellowbrick database and the ybrelay service:

--export-directory STRING

A single directory, a single file reference, a comma-delimited list of directories/files, or a file listing reference if the argument begins with list:. For example:

--export-directory file:///opt/s3/data/data/People.csv

--export-directory list:/home/ybtests/ybrelay/spark/conf/export_file
--format

The default is parquet. --format is pluggable; for example, avro and xml can be loaded by providing an external package. For example: --format com.databricks.spark.avro

The supported formats are:

Other Options

--bad-column-names

A subset of columns for recording in the .badrows file that the ybload operation produces. For example:

--bad-column-names event_id,YBLOAD_ERROR_COLUMN,YBLOAD_ERROR_REASON

The pseudo-columns YBLOAD_ERROR_COLUMN and YBLOAD_ERROR_REASON can be supplied to record the column containing an error and the reason, respectively.

--buffer-size

Buffer size for network packets to relay, in bytes (default: 2097152).

--cacert STRING

Customize trust with secured communication; use this option in combination with the --secured option. Enter the file name of a custom PEM-encoded certificate or the file name and password for a Java KeyStore (JKS).

For PEM format, the file must be named with a .pem, .cert, .cer, .crt, or .key extension. For example:

--cacert cacert.pem

For JKS format, files are always password-protected. Use the following format:

--cacert yellowbrick.jks:changeit

where the : character separates the file name and the password.

--column-names

Optionally, a list of the column names in the destination table. (If not specified, they are discovered when the Spark job connects to the target database.)

--computed-columns FILE | STRING

Specify one or more columns in expression form or use a properties file that contains expressions. The expression language is MVEL. This option is useful for including job-specific or task-specific context for the target table.

The following variables are supplied to each invocation of a computed column value:

For example:

two_columns=row.get(row.fieldIndex('one')) + '|' + row.get(row.fieldIndex('two'))
task_partition_id=taskContext.partitionId()
task_stage_id=taskContext.stageId()
--connection-timeout-seconds SECONDS

Communications timeout for connections, specified in seconds. Default: 120.

--dbname, -d

Name of the destination database (equivalent to the YBDATABASE environment variable). Default: yellowbrick

--disable-trust, -k

Disable SSL/TLS trust when using secured communications. Trust is enabled by default. See also Verifying SSL/TLS Encryption.

Important: This option is not supported for use on production systems and is only recommended for testing purposes. It may be useful to disable trust during testing, then enable it when a formal signed certificate is installed on the system.

--filter EXPRESSION

A valid Spark SQL filter expression to constrain the data. A filter expression is like a SQL WHERE clause but does not use the WHERE keyword.

For example, filter a source ORC file on _col1:

--filter " _col1 between 18 and 20 "
--filter
--help

Show usage/help. Default: false.

--host, -h

Destination server host name. Default: localhost

--jdbc-url STRING

JDBC URL to connect to.

--jdbc-properties STRING

JDBC properties file, containing configuration parameters for the external database.

--jdbc-table STRING

JDBC table to load.

--load-option

An option that can be passed to the ybload process. See ybload Options for possible options and values. You can specify --load-option multiple times in the command, once per load option. The following ybload options cannot be specified: --csv-delimiter, --csv-quote, --csv-quote-escape, --linesep, --nullmarker.

For example:

--load-option "--on-missing-field SUPPLYNULL"
--log-level STRING

Specify the export log level. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--logs STRING

Specify a directory where ybrelay logs will be exported, including a summary of the end of the ybload job.

--map FILE

Specify a column mapping file that contains simple column mappings (destination column=source column). For example, specify --map column-mappings.txt, where column-mappings.txt is a Java properties-style file with:

destination_column_one=source_column_one

In this case, a column called destination_column_one would be given input data from source_column_one.

--password, -W

Interactive prompt for the database user's password (equivalent to the YBPASSWORD environment variable). No default.

--port

Destination server port number (equivalent to the YBPORT environment variable). Default: 5432

--pre-sql, --post-sql

Specify a SQL statement that is invoked by the Spark driver before or after the load job. Alternatively, you can use a file reference by prefixing the file name with file:. For example, --pre-sql file:prepare-job.sql assumes that the file prepare-job.sql contains a SQL template to be executed before the execution of the job itself.

Template values can be inserted using the MVEL template syntax. The following template variables are provided:

  • transactionId: the identifier of this load job
  • sparkSession: see Spark session.
  • sparkContext: see Spark context.
  • config: The supplied and parsed configuration for the export. Included values are config.ybTableName, config.ybHost, and so on.

For example:

--pre-sql "insert into job_log(start_time, finish_time, transaction_id) values(current_timestamp, null, '@{transactionId}')" 
--post-sql "update job_log set finish_time = current_timestamp where transaction_id = '@{transactionId}'"
--queue-depth NUMBER

Queue depth for outbound network packets to relay. Default: 16.

--read-options STRING

Specify read options (such as basePath=) in an external properties file and provide the path to the file.

Some of the formats for export require such options to optimize the read process or change its behavior. For example, you can manage the export of a single partition directory structure for all formats by using basePath. For example:

--export-directory /user/hive/warehouse/my_table/part1=foo/part2=bar 
... 
--read-options read-options.properties 

read-options.properties: 
basePath=/user/hive/warehouse/my_table

If the basePath property is not specified, Spark will export the data in the given export directory but will not export the partition keys. For read options that require a unicode or escape code (for example, when specifying an exotic delimiter setting), use the unicode escaping syntax as follows:

read-options.properties: separator=\u0001
--read-schema FILE

Specify a schema, in a Spark-specific format, that feeds the read process for certain files that do not contain schema information, such as header-less CSV and ORC files. Specify the name of a file that contains the schema in the correct format. For example:

--read-schema schema_for_table.json

See Spark Format for Reading Schemas.

--read-timeout-seconds SECONDS

Communications timeout for reads, specified in seconds. Default: 0.

--secured

Use SSL/TLS to secure all communications. The default is not secured. See also Verifying SSL/TLS Encryption.

--spark-log-level STRING

Specify the Apache Spark log level setting. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--table, -t

Name of the target table to load.

--task-failures NUMBER

Specify the maximum number of tolerated errors for any Spark task. Default: 1. For details, see Error Tolerance. (See also the spark.task.maxFailures property.)

--truncate-prefix STRING, --truncate-suffix STRING

Specify a string that will be prepended or appended to data that exceeds the declared length of target columns.

--username, -U

Database login username (equivalent to the YBUSER environment variable). No default.

--write-timeout-seconds SECONDS

Communications timeout for writes, specified in seconds. Default: 0.

--yb-relay-host STRING

Host name of the system running the ybrelay service. Default: localhost.

--yb-relay-port NUMBER

ybrelay port number. Default: 21212.

Legacy Options

These options are deprecated but available for backward compatibility.

--yb-database STRING
Yellowbrick database name. The --dbname option is preferred.
--yb-host STRING
Yellowbrick database host name. The --host option is preferred.
--yb-password STRING
Yellowbrick user's password. The --password option is preferred.
--yb-port NUMBER
Yellowbrick database port number. Default: 5432. The --port option is preferred.
--yb-table STRING
Name of the Yellowbrick table being loaded. The --table option is preferred.
--yb-user STRING
Yellowbrick user name. The --user option is preferred.