Skip to content

Spark Application Options

The options listed in this section are Spark application settings that you use when you submit a Spark job that exports data to a Yellowbrick database. Some of these settings are required.

Required Settings

Every Spark job that exports data to Yellowbrick must define the format and location of the incoming data and connectivity settings for both the Yellowbrick database and the ybrelay service:

--export-directory STRING

A single directory, a single file reference, a comma-delimited list of directories/files, or a file listing reference if the argument begins with list:. For example:

--export-directory file:///opt/s3/data/data/People.csv

--export-directory list:/home/ybtests/ybrelay/spark/conf/export_file
--format

The default is parquet. --format is pluggable; for example, avro and xml can be loaded by providing an external package. For example: --format com.databricks.spark.avro

The supported formats are:

--yb-host STRING

Yellowbrick database host name.

--yb-port NUMBER

Yellowbrick database port number. Default: 5432

--yb-database STRING

Yellowbrick database name.

--yb-user STRING

Yellowbrick user name.

--yb-password STRING

Yellowbrick user's password.

--yb-relay-host STRING

Host name of the system running the ybrelay service. Default: localhost

--yb-relay-port NUMBER

ybrelay port number. Default: 21212

--yb-table STRING

Name of the Yellowbrick table being loaded.

Other Options

--bad-column-names

A subset of columns for recording in the .badrows file that the ybload operation produces. For example:

--bad-column-names event_id,YBLOAD_ERROR_COLUMN,YBLOAD_ERROR_REASON

The pseudo-columns YBLOAD_ERROR_COLUMN and YBLOAD_ERROR_REASON can be supplied to record the column containing an error and the reason, respectively.

--buffer-size

Buffer size for network packets to relay, in bytes (default: 2097152).

--column-names

Optionally, a list of the column names in the destination table. (If not specified, they are discovered when the Spark job connects to the target database.)

--computed-columns FILE | STRING

Specify one or more columns in expression form or use a properties file that contains expressions. The expression language is MVEL. This option is useful for including job-specific or task-specific context for the target table.

The following variables are supplied to each invocation of a computed column value:

For example:

two_columns=row.get(row.fieldIndex('one')) + '|' + row.get(row.fieldIndex('two'))
task_partition_id=taskContext.partitionId()
task_stage_id=taskContext.stageId()
--connection-timeout-seconds SECONDS

Communications timeout for connections, specified in seconds. Default: 120.

--filter EXPRESSION

A valid Spark SQL filter expression to constrain the data. A filter expression is like a SQL WHERE clause but does not use the WHERE keyword.

For example, filter a source ORC file on _col1:

--filter " _col1 between 18 and 20 "
--filter
--help

Show usage/help. Default: false.

--jdbc-url STRING

JDBC URL to connect to.

--jdbc-properties STRING

JDBC properties file, containing configuration parameters for the external database.

--jdbc-table STRING

JDBC table to load.

--load-option

An option that can be passed to the ybload process. See ybload Options for possible options and values. You can specify --load-option multiple times in the command, once per load option. The following ybload options cannot be specified: --csv-delimiter, --csv-quote, --csv-quote-escape, --linesep, --nullmarker.

For example:

--load-option "--on-missing-field SUPPLYNULL"
--log-level STRING

Specify the export log level. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--logs STRING

Specify a directory where ybrelay logs will be exported, including a summary of the end of the ybload job.

--map FILE

Specify a column mapping file that contains simple column mappings (destination column=source column). For example, specify --map column-mappings.txt, where column-mappings.txt is a Java properties-style file with:

destination_column_one=source_column_one

In this case, a column called destination_column_one would be given input data from source_column_one.

--pre-sql, --post-sql

Specify a SQL statement that is invoked by the Spark driver before or after the load job. Alternatively, you can use a file reference by prefixing the file name with file:. For example, --pre-sql file:prepare-job.sql assumes that the file prepare-job.sql contains a SQL template to be executed before the execution of the job itself.

Template values can be inserted using the MVEL template syntax. The following template variables are provided:

  • transactionId: the identifier of this load job
  • sparkSession: see Spark session.
  • sparkContext: see Spark context.
  • config: The supplied and parsed configuration for the export. Included values are config.ybTableName, config.ybHost, and so on.

For example:

--pre-sql "insert into job_log(start_time, finish_time, transaction_id) values(current_timestamp, null, '@{transactionId}')" 
--post-sql "update job_log set finish_time = current_timestamp where transaction_id = '@{transactionId}'"
--queue-depth NUMBER

Queue depth for outbound network packets to relay. Default: 16.

--read-options STRING

Specify read options (such as basePath=) in an external properties file and provide the path to the file.

Some of the formats for export require such options to optimize the read process or change its behavior. For example, you can manage the export of a single partition directory structure for all formats by using basePath. For example:

--export-directory /user/hive/warehouse/my_table/part1=foo/part2=bar 
... 
--read-options read-options.properties 

read-options.properties: 
basePath=/user/hive/warehouse/my_table

If the basePath property is not specified, Spark will export the data in the given export directory but will not export the partition keys. For read options that require a unicode or escape code (for example, when specifying an exotic delimiter setting), use the unicode escaping syntax as follows:

read-options.properties: separator=\u0001
--read-schema FILE

Specify a schema, in a Spark-specific format, that feeds the read process for certain files that do not contain schema information, such as header-less CSV and ORC files. Specify the name of a file that contains the schema in the correct format. For example:

--read-schema schema_for_table.json

See Spark Format for Reading Schemas.

--read-timeout-seconds SECONDS

Communications timeout for reads, specified in seconds. Default: 0.

--spark-log-level STRING

Specify the Apache Spark log level setting. Valid values: ERROR, WARN, INFO, DEBUG, TRACE.

--task-failures NUMBER

Specify the maximum number of tolerated errors for any Spark task. Default: 1. For details, see Error Tolerance. (See also the spark.task.maxFailures property.)

--truncate-prefix STRING, --truncate-suffix STRING

Specify a string that will be prepended or appended to data that exceeds the declared length of target columns.

--write-timeout-seconds SECONDS

Communications timeout for writes, specified in seconds. Default: 0.

Parent topic:Setting up and Running a Spark Job