Loading Tables with Spark

This section describes how to bulk load data from source files that ybload cannot read directly, such as Avro and ORC files. Regardless of its source location (HDFS, NFS, S3, and so on), you can load data in different formats by running Apache Spark jobs that use the Yellowbrick ybrelay service to call ybload. Exported data flows from the Spark application platform to a ybrelay server running the ybrelay service, then is loaded into Yellowbrick database tables using ybload operations.

Follow these steps to bulk load Yellowbrick tables via Spark and ybrelay. Subsequent sections explain these steps in detail and provide examples.

Install and set up Apache Spark and the ybrelay service.
Define the parameters for a spark-submit command:

Native Spark options
Spark application options:
- Yellowbrick database connectivity
- ybrelay connectivity
- General options
- ybload options, if needed

Run the spark-submit command.
Monitor the resulting ybload operation.

In This Section

Parent topic:Loading Tables

Setting Up Encryption

Creating an Alert Endpoint

config

Secure Connections for ODBC/JDBC Clients and ybsql

LDAP Authentication

Synchronizing Users and Groups

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Bulk Load Examples

Loading from Amazon S3

Loading from Azure Blob Storage

Setting Up the ybrelay Service

Setting up and Running a Spark Job

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

DECIMAL

Data Type Casting

SQL String Constants

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

SELECT

GROUP BY Clause

Subqueries

SHOW

SQL Conditions

String Functions

ENCRYPT_KS

Pattern Matching

SQL Operators and Pattern Matching Functions

Regular Expression Details

Datetime Functions

Mathematical Functions

Aggregate Functions

Window Functions

Conditional Expressions

Formatting Functions

JSON Functions

System Functions

Network Address Functions

Loading Tables with Spark ​

Loading Tables with Spark