Spark Format for Reading Schemas

You can use the --read-schema Spark application option to define the read process for files that do not contain schema information, such as header-less CSV and ORC files.

For example, here is the partial schema description for a specific table:

{
  "type": "struct",
  "fields": [
   { "name": "ws_sold_date_sk","type": "integer", "nullable": true },
   { "name": "ws_sold_time_sk","type": "integer", "nullable": true },
   { "name": "ws_ship_date_sk","type": "integer", "nullable": true },
   { "name": "ws_item_sk","type": "integer", "nullable": false },
   { "name": "ws_wholesale_cost","type": "decimal(7,2)", "nullable": true },
   ...
  ]
}

In this format, you have to map each column from the input schema that is being read, using one of the following types. (The configuration for writes is determined dynamically by ybload and the Spark export process.) The typical mappings for Yellowbrick data types are shown in parentheses:

decimal (decimal)
double (double precision)
float (real)
long (bigint)
integer (int/integer)
short (smallint)
byte (tinyint)
string (varchar)
boolean (bool/boolean)
date (date)
timestamp (timestamp)

Note: Yellowbrick does not support all the data types that Spark supports. For example, the array type is not supported.

Parent topic:Setting up and Running a Spark Job

Setting Up Encryption

Creating an Alert Endpoint

LDAP Authentication

Synchronizing Users and Groups

Loading from Amazon S3

Loading from Azure Blob Storage

Creating Resource Pools

Creating Rules

Subqueries

ENCRYPT_KS

SQL Operators and Pattern Matching Functions

Regular Expression Details

Spark Format for Reading Schemas ​

Spark Format for Reading Schemas