Loading Tables from Parquet Files

This section explains how to load a table from Apache Parquet source files (a structured, columnar storage format). Certain load options and parameters that work for flat files are not supported for parquet format loads, and a few options are specific to parquet loads.

Parquet Schema Support

Before attempting to load parquet data into a Yellowbrick table:

Download parquet-tools (Apache Parquet command-line tools and utilities) so that you have a convenient way to inspect the schema of specific files, including data types, input column names, and the structure of individual fields.

These tools are not provided by Yellowbrick; you can download them from various sites. For example, for macOS clients, go to: https://formulae.brew.sh/formula/parquet-tools

Make sure that the target column names in the table DDL match the names in the parquet schema. Source fields (input columns) and target table columns (DDL column names) are matched by name.
Check the data types used in the source files and the overall structure of the data. Native parquet data types are automatically mapped (and cast) to the standard set of Yellowbrick data types; you do not need to specify any mapping. See Parquet Schema Mapping and Type Casting for details.

However, if your source files contain unsupported types or a schema that ybload does not recognize, by default ybload returns expected errors. Certain primitive types, logical annotations, and nested structures are not supported:

INTERVAL logical type
Nested fields (nested data is not loaded by default, but can be serialized and loaded). For example:

message schema {
  optional group employee {
   optional int64 age;
  }
}

Nested LIST and MAP logical types

Fields with a repetition level of repeated that occurs outside the context of a LIST logical type. For example, the following schema is not supported:

message schema {
    repeated int64 int64_list;
}

However, the following schema is supported:

message schema {
  optional group int64_list (LIST) {
    repeated group list {
      optional int64 item;
    }
  }
}

In addition, the Apache Parquet specification is constantly evolving, and various implementations exist in the field, some of which do not follow the specification strictly. In turn, ybload may not support certain source files.

The --ignore-unsupported-schema option is a means of bypassing errors for unsupported types and schemas. This option takes effect when the Parquet metadata is read, and removes the unsupported fields from the source file schema before any mapping is done. You need to be aware of the implications of this behavior. For example, if a field is ignored in the parquet file, but required (declared NOT NULL) in the table, ybload returns an error.

Although nested data is not supported by default, the --serialize-nested-as-json option serializes nested parquet data in JSON format so that it can be loaded. Serialized JSON strings can be loaded into VARCHAR columns.

Always specify the --format parquet option in the ybload command. No other special parquet format parameters are required.

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

sys.lock

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Loading from Amazon S3

Loading from Azure Blob Storage

Setting up and Running a Spark Job

Setting Up the ybrelay Service

LDAP Authentication

Synchronizing Users and Groups

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

AWS Marketplace

Create Stack

Docker

Cloud: Configuration

Vanity DNS

Yellowbrick Manager

Cloud: Enterprise Edition Getting Started

SQL-Based Loads from External Storage

Cloud: Installation

CLI Install Instructions

Permissions

Private Install Instructions

Public Install Instructions

Cloud: Kubernetes Guides

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

Data Type Casting

DECIMAL

JSON

JSONB

SQL String Constants

Aggregate Functions

Conditional Expressions

Datetime Functions

Formatting Functions

Geospatial functions

Mathematical Functions

Network Address Functions

Pattern Matching

Regular Expression Details

SQL Operators and Pattern Matching Functions

SQL Conditions

SQL User Defined Function (UDF)

SQL UDF Create Function

String Functions

ENCRYPT_KS

System Functions

Type-Safe Casting Functions

Window Functions

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

Loading Tables from Parquet Files ​

Parquet Schema Support ​

Loading Tables from Parquet Files

Parquet Schema Support