Appearance
Examples of Spark Jobs
These example job submissions are for a single standalone spark cluster, reading from a local file system using the file://
URI prefix. Other well-known protocols (such as S3, NFS, HDFS) can be used.
This example submits a job that exports data from a CSV file with a header:
/opt/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
--class io.yellowbrick.spark.cli.SparkExport \
--master local \
--executor-memory 2G \
--num-executors 1 \
--executor-cores 1 \
/opt/ybtools/integrations/spark/relay-integration-spark-*-shaded.jar \
--format csv \
--export-directory file:///opt/s3/data/data/People.csv \
--read-options /opt/spark/spark-2.4.3-bin-hadoop2.7/read-options-csv.properties \
--logs /opt/s3/data/logs \
--yb-relay-host localhost \
--yb-relay-port 21212 \
--yb-host mydemo.yellowbrick.io \
--yb-port 5432 \
--yb-user username \
--yb-password password \
--yb-database ybspark_demo \
--yb-table people2
For spark to read the header line successfully, the referenced read-options-csv.properties
file must contain:
header=true
The following example exports data from a JSON file:
/opt/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
--class io.yellowbrick.spark.cli.SparkExport \
--master local \
--executor-memory 2G \
--num-executors 1 \
--executor-cores 1 \
/opt/ybtools/integrations/spark/relay-integration-spark-*-shaded.jar \
--format json \
--export-directory file:///opt/s3/data/data/people_json.json \
--read-options /opt/spark/spark-2.4.3-bin-hadoop2.7/read-options-json.properties \
--logs /opt/s3/data/logs \
--yb-relay-host localhost \
--yb-relay-port 21212 \
--yb-host mydemo.yellowbrick.io \
--yb-port 5432 \
--yb-user username \
--yb-password password \
--yb-database ybspark_demo \
--yb-table people2
Depending on how well-formed the JSON document is, some manipulation might be needed. You might need to create a read-options-json.properties
file that contains:
multiLine=true
The incoming JSON data in this case looks like this:
[{
"playerID" : "aardsda01",
"birthYear" : "1981",
...
"finalGame" : "2015-08-23"
}, {
"playerID" : "aaronha01",
"birthYear" : "1934",
...
"finalGame" : "1976-10-03"
}]
Parent topic:Setting up and Running a Spark Job