Appearance
Loading Tables from Parquet Files
This section explains how to load a table from Apache Parquet source files (a structured, columnar storage format). Certain load options and parameters that work for flat files are not supported for parquet
format loads, and a few options are specific to parquet
loads.
Parquet Schema Support
Before attempting to load parquet
data into a Yellowbrick table:
- Download
parquet-tools
(Apache Parquet command-line tools and utilities) so that you have a convenient way to inspect the schema of specific files, including data types, input column names, and the structure of individual fields.
These tools are not provided by Yellowbrick; you can download them from various sites. For example, for macOS clients, go to: https://formulae.brew.sh/formula/parquet-tools
- Make sure that the target column names in the table DDL match the names in the
parquet
schema. Source fields (input columns) and target table columns (DDL column names) are matched by name. - Check the data types used in the source files and the overall structure of the data. Native
parquet
data types are automatically mapped (and cast) to the standard set of Yellowbrick data types; you do not need to specify any mapping. See Parquet Schema Mapping and Type Casting for details.
However, if your source files contain unsupported types or a schema that ybload
does not recognize, by default ybload
returns expected errors. Certain primitive types, logical annotations, and nested structures are not supported:
INTERVAL
logical type- Nested fields (nested data is not loaded by default, but can be serialized and loaded). For example:
message schema {
optional group employee {
optional int64 age;
}
}
Nested
LIST
andMAP
logical typesFields with a repetition level of
repeated
that occurs outside the context of aLIST
logical type. For example, the following schema is not supported:message schema { repeated int64 int64_list; }
However, the following schema is supported:
message schema { optional group int64_list (LIST) { repeated group list { optional int64 item; } } }
In addition, the Apache Parquet specification is constantly evolving, and various implementations exist in the field, some of which do not follow the specification strictly. In turn, ybload
may not support certain source files.
The --ignore-unsupported-schema
option is a means of bypassing errors for unsupported types and schemas. This option takes effect when the Parquet metadata is read, and removes the unsupported fields from the source file schema before any mapping is done. You need to be aware of the implications of this behavior. For example, if a field is ignored in the parquet
file, but required (declared NOT NULL
) in the table, ybload
returns an error.
Although nested data is not supported by default, the --serialize-nested-as-json
option serializes nested parquet
data in JSON format so that it can be loaded. Serialized JSON strings can be loaded into VARCHAR
columns.
- Always specify the
--format parquet
option in theybload
command. No other specialparquet
format parameters are required.