Spark Application Options

The options listed in this section are Spark application settings that you can use when you submit a Spark job that exports data to a Yellowbrick database. Some of these settings are required.

Required Settings

Every Spark job that exports data to Yellowbrick must define the format and location of the incoming data and connectivity settings for both the Yellowbrick database and the ybrelay service:

--export-directory STRING

A single directory, a single file reference, a comma-delimited list of directories/files, or a file listing reference if the argument begins with list:. For example:

--export-directory file:///opt/s3/data/data/People.csv

--export-directory list:/home/ybtests/ybrelay/spark/conf/export_file

--format

The default is parquet. --format is pluggable; for example, avro and xml can be loaded by providing an external package. For example: --format com.databricks.spark.avro

The supported formats are:

json
csv
text
orc
xml (via plugin; see https://github.com/databricks/spark-xml) )
avro (via plugin; see https://github.com/databricks/spark-avro)
jdbc (--export-directory not required). Specifying this format triggers the requirement to set the --jdbc-* options.

Other Options

--bad-column-names

A subset of columns for recording in the .badrows file that the ybload operation produces. For example:

--bad-column-names event_id,YBLOAD_ERROR_COLUMN,YBLOAD_ERROR_REASON

The pseudo-columns YBLOAD_ERROR_COLUMN and YBLOAD_ERROR_REASON can be supplied to record the column containing an error and the reason, respectively.

--buffer-size

Buffer size for network packets to relay, in bytes (default: 2097152).

--cacert STRING

Customize trust with secured communication; use this option in combination with the --secured option. Enter the file name of a custom PEM-encoded certificate or the file name and password for a Java KeyStore (JKS).

For PEM format, the file must be named with a .pem, .cert, .cer, .crt, or .key extension. For example:

--cacert cacert.pem

For JKS format, files are always password-protected. Use the following format:

--cacert yellowbrick.jks:changeit

where the : character separates the file name and the password.

--column-names

Optionally, a list of the column names in the destination table. (If not specified, they are discovered when the Spark job connects to the target database.)

--computed-columns FILE | STRING

Specify one or more columns in expression form or use a properties file that contains expressions. The expression language is MVEL. This option is useful for including job-specific or task-specific context for the target table.

The following variables are supplied to each invocation of a computed column value:

transactionId: the identifier of the row in the load job
Spark row struct. See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
taskContext: Spark task context. See https://spark.apache.org/docs/latest/api/java/org/apache/spark/TaskContext.html

For example:

two_columns=row.get(row.fieldIndex('one')) + '|' + row.get(row.fieldIndex('two'))
task_partition_id=taskContext.partitionId()
task_stage_id=taskContext.stageId()

--connection-timeout-seconds SECONDS

Communications timeout for connections, specified in seconds. Default: 120.

--dbname, -d

Name of the destination database (equivalent to the YBDATABASE environment variable). Default: yellowbrick

--disable-trust, -k

Disable SSL/TLS trust when using secured communications. Trust is enabled by default. See also Verifying SSL/TLS Encryption.

Important: This option is not supported for use on production systems and is only recommended for testing purposes. It may be useful to disable trust during testing, then enable it when a formal signed certificate is installed on the system.

--filter EXPRESSION

A valid Spark SQL filter expression to constrain the data. A filter expression is like a SQL WHERE clause but does not use the WHERE keyword.

For example, filter a source ORC file on _col1:

--filter " _col1 between 18 and 20 "
--filter

--help

Show usage/help. Default: false.

--host, -h

Destination server host name. Default: localhost

--jdbc-url STRING

JDBC URL to connect to.

--jdbc-properties STRING

JDBC properties file, containing configuration parameters for the external database.

--jdbc-table STRING

JDBC table to load.

--load-option

An option that can be passed to the ybload process. See ybload Options for possible options and values. You can specify --load-option multiple times in the command, once per load option. The following ybload options cannot be specified: --csv-delimiter, --csv-quote, --csv-quote-escape, --linesep, --nullmarker.

For example:

--load-option "--on-missing-field SUPPLYNULL"

--log-level STRING

Specify the export log level. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--logs STRING

Specify a directory where ybrelay logs will be exported, including a summary of the end of the ybload job.

--map FILE

Specify a column mapping file that contains simple column mappings (destination column=source column). For example, specify --map column-mappings.txt, where column-mappings.txt is a Java properties-style file with:

destination_column_one=source_column_one

In this case, a column called destination_column_one would be given input data from source_column_one.

--password, -W

Interactive prompt for the database user's password (equivalent to the YBPASSWORD environment variable). No default.

--port

Destination server port number (equivalent to the YBPORT environment variable). Default: 5432

--pre-sql, --post-sql

Specify a SQL statement that is invoked by the Spark driver before or after the load job. Alternatively, you can use a file reference by prefixing the file name with file:. For example, --pre-sql file:prepare-job.sql assumes that the file prepare-job.sql contains a SQL template to be executed before the execution of the job itself.

Template values can be inserted using the MVEL template syntax. The following template variables are provided:

transactionId: the identifier of this load job
sparkSession: see Spark session.
sparkContext: see Spark context.
config: The supplied and parsed configuration for the export. Included values are config.ybTableName, config.ybHost, and so on.

For example:

--pre-sql "insert into job_log(start_time, finish_time, transaction_id) values(current_timestamp, null, '@{transactionId}')" 
--post-sql "update job_log set finish_time = current_timestamp where transaction_id = '@{transactionId}'"

--queue-depth NUMBER

Queue depth for outbound network packets to relay. Default: 16.

--read-options STRING

Specify read options (such as basePath=) in an external properties file and provide the path to the file.

Some of the formats for export require such options to optimize the read process or change its behavior. For example, you can manage the export of a single partition directory structure for all formats by using basePath. For example:

--export-directory /user/hive/warehouse/my_table/part1=foo/part2=bar 
... 
--read-options read-options.properties 

read-options.properties: 
basePath=/user/hive/warehouse/my_table

If the basePath property is not specified, Spark will export the data in the given export directory but will not export the partition keys. For read options that require a unicode or escape code (for example, when specifying an exotic delimiter setting), use the unicode escaping syntax as follows:

read-options.properties: separator=\u0001

--read-schema FILE

Specify a schema, in a Spark-specific format, that feeds the read process for certain files that do not contain schema information, such as header-less CSV and ORC files. Specify the name of a file that contains the schema in the correct format. For example:

--read-schema schema_for_table.json

See Spark Format for Reading Schemas.

--read-timeout-seconds SECONDS

Communications timeout for reads, specified in seconds. Default: 0.

--secured

Use SSL/TLS to secure all communications. The default is not secured. See also Verifying SSL/TLS Encryption.

--spark-log-level STRING

Specify the Apache Spark log level setting. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--table, -t

Name of the target table to load.

--task-failures NUMBER

Specify the maximum number of tolerated errors for any Spark task. Default: 1. For details, see Error Tolerance. (See also the spark.task.maxFailures property.)

--truncate-prefix STRING, --truncate-suffix STRING

Specify a string that will be prepended or appended to data that exceeds the declared length of target columns.

--username, -U

Database login username (equivalent to the YBUSER environment variable). No default.

--write-timeout-seconds SECONDS

Communications timeout for writes, specified in seconds. Default: 0.

--yb-relay-host STRING

Host name of the system running the ybrelay service. Default: localhost.

--yb-relay-port NUMBER

ybrelay port number. Default: 21212.

Legacy Options

These options are deprecated but available for backward compatibility.

--yb-database STRING: Yellowbrick database name. The --dbname option is preferred.
--yb-host STRING: Yellowbrick database host name. The --host option is preferred.
--yb-password STRING: Yellowbrick user's password. The --password option is preferred.
--yb-port NUMBER: Yellowbrick database port number. Default: 5432. The --port option is preferred.
--yb-table STRING: Name of the Yellowbrick table being loaded. The --table option is preferred.
--yb-user STRING: Yellowbrick user name. The --user option is preferred.

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

sys.lock

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Loading from Amazon S3

Loading from Azure Blob Storage

Setting up and Running a Spark Job

Setting Up the ybrelay Service

LDAP Authentication

Synchronizing Users and Groups

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

AWS Marketplace

Create Stack

Docker

Cloud: Configuration

Vanity DNS

Yellowbrick Manager

Cloud: Enterprise Edition Getting Started

SQL-Based Loads from External Storage

Cloud: Installation

CLI Install Instructions

Permissions

Private Install Instructions

Public Install Instructions

Cloud: Kubernetes Guides

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

Data Type Casting

DECIMAL

JSON

JSONB

SQL String Constants

Aggregate Functions

Conditional Expressions

Datetime Functions

Formatting Functions

Geospatial functions

Mathematical Functions

Network Address Functions

Pattern Matching

Regular Expression Details

SQL Operators and Pattern Matching Functions

SQL Conditions

SQL User Defined Function (UDF)

SQL UDF Create Function

String Functions

ENCRYPT_KS

System Functions

Type-Safe Casting Functions

Window Functions

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

Spark Application Options ​

Required Settings ​

Other Options ​

Legacy Options ​

Spark Application Options

Required Settings

Other Options

Legacy Options