Appearance
Spark Application Options
The options listed in this section are Spark application settings that you use when you submit a Spark job that exports data to a Yellowbrick database. Some of these settings are required.
Required Settings
Every Spark job that exports data to Yellowbrick must define the format and location of the incoming data and connectivity settings for both the Yellowbrick database and the ybrelay
service:
- --export-directory STRING
A single directory, a single file reference, a comma-delimited list of directories/files, or a file listing reference if the argument begins with
list:
. For example:--export-directory file:///opt/s3/data/data/People.csv --export-directory list:/home/ybtests/ybrelay/spark/conf/export_file
- --format
The default is
parquet
.--format
is pluggable; for example,avro
andxml
can be loaded by providing an external package. For example:--format com.databricks.spark.avro
The supported formats are:
json
csv
text
orc
xml
(via plugin; see https://github.com/databricks/spark-xml) )avro
(via plugin; see https://github.com/databricks/spark-avro)jdbc
(--export-directory
not required). Specifying this format triggers the requirement to set the--jdbc-*
options.
- --yb-host STRING
Yellowbrick database host name.
- --yb-port NUMBER
Yellowbrick database port number. Default:
5432
- --yb-database STRING
Yellowbrick database name.
- --yb-user STRING
Yellowbrick user name.
- --yb-password STRING
Yellowbrick user's password.
- --yb-relay-host STRING
Host name of the system running the
ybrelay
service. Default:localhost
- --yb-relay-port NUMBER
ybrelay
port number. Default: 21212- --yb-table STRING
Name of the Yellowbrick table being loaded.
Other Options
- --bad-column-names
A subset of columns for recording in the
.badrows
file that the ybload operation produces. For example:--bad-column-names event_id,YBLOAD_ERROR_COLUMN,YBLOAD_ERROR_REASON
The pseudo-columns
YBLOAD_ERROR_COLUMN
andYBLOAD_ERROR_REASON
can be supplied to record the column containing an error and the reason, respectively.- --buffer-size
Buffer size for network packets to relay, in bytes (default:
2097152
).- --column-names
Optionally, a list of the column names in the destination table. (If not specified, they are discovered when the Spark job connects to the target database.)
- --computed-columns FILE | STRING
Specify one or more columns in expression form or use a properties file that contains expressions. The expression language is MVEL. This option is useful for including job-specific or task-specific context for the target table.
The following variables are supplied to each invocation of a computed column value:
transactionId
: the identifier of the row in the load job- Spark row struct. See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
- taskContext: Spark task context. See https://spark.apache.org/docs/latest/api/java/org/apache/spark/TaskContext.html
For example:
two_columns=row.get(row.fieldIndex('one')) + '|' + row.get(row.fieldIndex('two')) task_partition_id=taskContext.partitionId() task_stage_id=taskContext.stageId()
- --connection-timeout-seconds SECONDS
Communications timeout for connections, specified in seconds. Default:
120
.- --filter EXPRESSION
A valid Spark SQL filter expression to constrain the data. A filter expression is like a SQL
WHERE
clause but does not use theWHERE
keyword.For example, filter a source
ORC
file on_col1
:--filter " _col1 between 18 and 20 " --filter
- --help
Show usage/help. Default:
false
.- --jdbc-url STRING
JDBC URL to connect to.
- --jdbc-properties STRING
JDBC properties file, containing configuration parameters for the external database.
- --jdbc-table STRING
JDBC table to load.
- --load-option
An option that can be passed to the
ybload
process. See ybload Options for possible options and values. You can specify--load-option
multiple times in the command, once per load option. The followingybload
options cannot be specified:--csv-delimiter
,--csv-quote
,--csv-quote-escape
,--linesep
,--nullmarker
.For example:
--load-option "--on-missing-field SUPPLYNULL"
- --log-level STRING
Specify the export log level. Valid values:
ERROR
,WARN
,INFO
,DEBUG
,TRACE
.- --logs STRING
Specify a directory where
ybrelay
logs will be exported, including a summary of the end of theybload
job.- --map FILE
Specify a column mapping file that contains simple column mappings (destination column=source column). For example, specify
--map column-mappings.txt
, wherecolumn-mappings.txt
is a Java properties-style file with:destination_column_one=source_column_one
In this case, a column called
destination_column_one
would be given input data fromsource_column_one
.- --pre-sql, --post-sql
Specify a SQL statement that is invoked by the Spark driver before or after the load job. Alternatively, you can use a file reference by prefixing the file name with
file:
. For example,--pre-sql file:prepare-job.sql
assumes that the fileprepare-job.sql
contains a SQL template to be executed before the execution of the job itself.Template values can be inserted using the MVEL template syntax. The following template variables are provided:
transactionId
: the identifier of this load jobsparkSession
: see Spark session.sparkContext
: see Spark context.config
: The supplied and parsed configuration for the export. Included values areconfig.ybTableName
,config.ybHost
, and so on.
For example:
--pre-sql "insert into job_log(start_time, finish_time, transaction_id) values(current_timestamp, null, '@{transactionId}')" --post-sql "update job_log set finish_time = current_timestamp where transaction_id = '@{transactionId}'"
- --queue-depth NUMBER
Queue depth for outbound network packets to relay. Default:
16
.- --read-options STRING
Specify read options (such as
basePath=
) in an external properties file and provide the path to the file.Some of the formats for export require such options to optimize the read process or change its behavior. For example, you can manage the export of a single partition directory structure for all formats by using
basePath
. For example:--export-directory /user/hive/warehouse/my_table/part1=foo/part2=bar ... --read-options read-options.properties read-options.properties: basePath=/user/hive/warehouse/my_table
If the
basePath
property is not specified, Spark will export the data in the given export directory but will not export the partition keys. For read options that require a unicode or escape code (for example, when specifying an exotic delimiter setting), use the unicode escaping syntax as follows:read-options.properties: separator=\u0001
- --read-schema FILE
Specify a schema, in a Spark-specific format, that feeds the read process for certain files that do not contain schema information, such as header-less CSV and ORC files. Specify the name of a file that contains the schema in the correct format. For example:
--read-schema schema_for_table.json
- --read-timeout-seconds SECONDS
Communications timeout for reads, specified in seconds. Default:
0
.- --spark-log-level STRING
Specify the Apache Spark log level setting. Valid values:
ERROR
,WARN
,INFO
,DEBUG
,TRACE
.- --task-failures NUMBER
Specify the maximum number of tolerated errors for any Spark task. Default:
1
. For details, see Error Tolerance. (See also thespark.task.maxFailures
property.)- --truncate-prefix STRING, --truncate-suffix STRING
Specify a string that will be prepended or appended to data that exceeds the declared length of target columns.
- --write-timeout-seconds SECONDS
Communications timeout for writes, specified in seconds. Default:
0
.
Parent topic:Setting up and Running a Spark Job