Skip to content

Spark Format for Reading Schemas

You can use the --read-schema Spark application option to define the read process for files that do not contain schema information, such as header-less CSV and ORC files.

For example, here is the partial schema description for a specific table:

{
  "type": "struct",
  "fields": [
   { "name": "ws_sold_date_sk","type": "integer", "nullable": true },
   { "name": "ws_sold_time_sk","type": "integer", "nullable": true },
   { "name": "ws_ship_date_sk","type": "integer", "nullable": true },
   { "name": "ws_item_sk","type": "integer", "nullable": false },
   { "name": "ws_wholesale_cost","type": "decimal(7,2)", "nullable": true },
   ...
  ]
}

In this format, you have to map each column from the input schema that is being read, using one of the following types. (The configuration for writes is determined dynamically by ybload and the Spark export process.) The typical mappings for Yellowbrick data types are shown in parentheses:

  • decimal (decimal)
  • double (double precision)
  • float (real)
  • long (bigint)
  • integer (int/integer)
  • short (smallint)
  • byte (tinyint)
  • string (varchar)
  • boolean (bool/boolean)
  • date (date)
  • timestamp (timestamp)

Note: Yellowbrick does not support all the data types that Spark supports. For example, the array type is not supported.

Parent topic:Setting up and Running a Spark Job