Examples of Spark Jobs

These example job submissions are for a single standalone spark cluster, reading from a local file system using the file:// URI prefix. Other well-known protocols (such as S3, NFS, HDFS) can be used.

This example submits a job that exports data from a CSV file with a header:

/opt/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
--class io.yellowbrick.spark.cli.SparkExport \
--master local \
--executor-memory 2G \
--num-executors 1 \
--executor-cores 1 \
/opt/ybtools/integrations/spark/relay-integration-spark-*-shaded.jar \
--format csv \
--export-directory file:///opt/s3/data/data/People.csv \
--read-options /opt/spark/spark-2.4.3-bin-hadoop2.7/read-options-csv.properties \
--logs /opt/s3/data/logs \
--yb-relay-host localhost \
--yb-relay-port 21212 \
--yb-host mydemo.yellowbrick.io \
--yb-port 5432 \
--yb-user username \
--yb-password password \
--yb-database ybspark_demo \
--yb-table people2

For spark to read the header line successfully, the referenced read-options-csv.properties file must contain:

header=true

The following example exports data from a JSON file:

/opt/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
--class io.yellowbrick.spark.cli.SparkExport \
--master local \
--executor-memory 2G \
--num-executors 1 \
--executor-cores 1 \
/opt/ybtools/integrations/spark/relay-integration-spark-*-shaded.jar \
--format json \
--export-directory file:///opt/s3/data/data/people_json.json \
--read-options /opt/spark/spark-2.4.3-bin-hadoop2.7/read-options-json.properties \
--logs /opt/s3/data/logs \
--yb-relay-host localhost \
--yb-relay-port 21212 \
--yb-host mydemo.yellowbrick.io \
--yb-port 5432 \
--yb-user username \
--yb-password password \
--yb-database ybspark_demo \
--yb-table people2

Depending on how well-formed the JSON document is, some manipulation might be needed. You might need to create a read-options-json.properties file that contains:

multiLine=true

The incoming JSON data in this case looks like this:

[{
  "playerID" : "aardsda01",
  "birthYear" : "1981",
  ...
  "finalGame" : "2015-08-23"
}, {
  "playerID" : "aaronha01",
  "birthYear" : "1934",
  ...
  "finalGame" : "1976-10-03"
}]

Parent topic:Setting up and Running a Spark Job

Setting Up Encryption

Creating an Alert Endpoint

LDAP Authentication

Synchronizing Users and Groups

Loading from Amazon S3

Loading from Azure Blob Storage

Creating Resource Pools

Creating Rules

Subqueries

ENCRYPT_KS

SQL Operators and Pattern Matching Functions

Regular Expression Details

Examples of Spark Jobs ​

Examples of Spark Jobs