Appearance
Parquet Load Examples
This section shows a few examples of ybload
commands that load from parquet
source files.
Simple parquet load from the local file system
The following command loads the match
table from a local file called match.snappy.parquet
:
$ ybload -d premdb --username yellowbrick -t match -W --format parquet
--logfile /tmp/match.log --logfile-log-level info /tmp/match.snappy.parquet
Password for user yellowbrick:
...
10:15:40.089 [ INFO] Logfile written to /tmp/match.log
10:15:40.142 [ INFO] Gathering metadata on input files
10:15:40.658 [ INFO] Bad rows will be written to /tmp/match.snappy.parquet.20211215101540.bad
...
10:15:41.151 [ INFO] Flushing last 8606 rows (of 8606 total) in transaction #1 for match
10:15:41.159 [ INFO] Committing 8606 rows into transaction #1 for match ...
10:15:41.186 [ INFO] Committed transaction #1 after a total of 292604 bytes and 8606 good rows for match
10:15:41.201 [ INFO] READ:41.07KB(17.49KB/s). ROWS G/B: 8606/0( 3.58K/s). WRITE:285.7KB(121.7KB/s). TIME E/R: 0:00:02/ --:--:--
10:15:41.202 [ INFO] SUCCESSFUL BULK LOAD: Loaded 8606 good rows from 1 source(s) in 0:00:02 (READ: 17.49KB/s WRITE: 121.7KB/s)
Loading a table when source and target column names do not align
This example demonstrates a potential problem (and a solution) when the parquet
schema and the DDL for the target table do not have identical column names. Here is the schema of a table called newmatch
:
premdb=# \d newmatch
Table "public.newmatch"
Column | Type | Modifiers
-------------+-----------------------------+-----------
seasonid | smallint |
newmatchday | timestamp without time zone |
htid | smallint |
atid | smallint |
ftscore | character(3) |
htscore | character(3) |
Distribution: Hash (seasonid)
Here is the schema of the match.snappy.parquet
file:
% parquet-tools schema match.snappy.parquet
message spark_schema {
optional int32 seasonid;
optional binary matchday (STRING);
optional int32 htid;
optional int32 atid;
optional binary ftscore (STRING);
optional binary htscore (STRING);
}
Note that the second column newmatchday
in the table is called matchday
in the parquet
file. Consequently, when you load newmatch
from this file, the load completes but puts NULL
values into the newmatchday
column:
premdb=# select * from newmatch where seasonid=12 order by 1 limit 5;
seasonid | newmatchday | htid | atid | ftscore | htscore
----------+-------------+------+------+---------+---------
12 | [NULL] | 42 | 75 | 1-2 | 0-2
12 | [NULL] | 42 | 76 | 0-0 | 0-0
12 | [NULL] | 42 | 77 | 1-0 | 0-0
12 | [NULL] | 42 | 81 | 4-3 | 2-1
12 | [NULL] | 42 | 86 | 1-3 | 0-2
(5 rows)
The solution in this case is to use the --field-defs
option to override the default field names specified by the source file. You must specify all of the source fields, but you can rename matchday
as newmatchday
so that the target table accepts it.
$ ybload -d premdb --username bobr -W -t newmatch_matchday_not_null -W --format parquet
--field-defs seasonid,newmatchday,htid,atid,ftscore,htscore
/home/brumsby/premdb-snappy-parquet/match.snappy.parquet
Password for user bobr:
...
Parquet load with nested data inserted as a JSON string
The following example loads a source file with the following schema to demonstrate how nested data structures in parquet
data can be serialized into a single column in the target table:
% parquet-tools schema person_data_1.parquet
message schema {
optional binary name (STRING);
optional int64 age;
optional group languages {
optional binary mother_language (STRING);
optional group other_languages (LIST) {
repeated group list {
optional binary item (STRING);
}
}
}
}
The target table for the load, person_lang_not_null_default
, has the following DDL:
premdb=# \d person_lang_not_null_default;
Table "public.person_lang_not_null_default"
Column | Type | Modifiers
-----------+------------------------+----------------------------
name | character varying(25) |
age | smallint |
languages | character varying(100) | not null default 'English'
Note that the third column, languages
, is declared NOT NULL
and has a default value. This column is intended to map to the nested languages
structure in the parquet
file, which consists of languages
, mother_language
, and other_languages
values.
The following ybload
command uses the --serialize-nested-as-json
option to load the nested data as a JSON string into a single VARCHAR
column:
$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr
--format parquet --serialize-nested-as-json /home/bobr/pqdata/person_data_1.parquet
Password for user bobr:
...
15:46:02.550 [ INFO] Committed transaction #1 after a total of 260 bytes and 3 good rows for person_lang_not_null_default
15:46:02.562 [ INFO] READ: 1.80KB( 1.63KB/s). ROWS G/B: 3/0( 2.73 /s). WRITE:260.0 B(236.2 B/s). TIME E/R: 0:00:01/ --:--:--
15:46:02.562 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in 0:00:01 (READ: 1.63KB/s WRITE: 236.2 B/s)
After the load, the table contains these three rows:
premdb=# select * from person_lang_not_null_default;
name | age | languages
------+-----+-------------------------------------------------------------------------------
Tom | 35 | {"mother_language":"Spanish","other_languages":["English","German","French"]}
Bob | 25 | {"mother_language":"English","other_languages":["German","French"]}
Ruby | 30 | {"mother_language":"English","other_languages":["German","French"]}
(3 rows)
The same ybload
command without the --serialize-nested-as-json
option would cause the load to fail:
$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr
--format parquet /home/bobr/pqdata/person_data_1.parquet
Password for user bobr:
...
15:54:54.851 [ INFO] Gathering metadata on input files
15:54:55.235 [FATAL] Parquet Error: Nested structure not supported, at field:
optional group languages {
optional binary mother_language (STRING);
optional group other_languages (LIST) {
repeated group list {
optional binary item (STRING);
}
}
}
See the following example for details about working around this issue by ignoring the unsupported nested schema.
Parquet load with nested data, ignoring unsupported data types
This example uses the same source file and target table as the previous example. In this case, the --ignore-unsupported-schema
is used to ignore the nested data and instead load the third column of the table with its declared DEFAULT
value.
$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr
--format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr:
...
16:32:19.326 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in 0:00:01 (READ: 1.59KB/s WRITE: 40.65 B/s)
The following rows are loaded:
premdb=# select * from person_lang_not_null_default;
name | age | languages
------+-----+-----------
Bob | 25 | English
Ruby | 30 | English
Tom | 35 | English
(3 rows)
If the languages
column does not have a DEFAULT
value and cannot be NULL
, the same load fails:
premdb=# \d person_lang_not_null
Table "public.person_lang_not_null"
Column | Type | Modifiers
-----------+------------------------+-----------
name | character varying(25) |
age | smallint |
languages | character varying(100) | not null
...
$ ./ybload -d premdb -t person_lang_not_null -W --username bobr --format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr:
...
16:43:27.146 [ INFO] Gathering metadata on input files
16:43:27.539 [FATAL] No source field found for the following destination columns:
languages
Source Fields:
name
age
Destination Columns:
name
age
languages
If the languages
column is nullable, the same load succeeds, and NULL
values are loaded into the languages
column:
premdb=# \d person_lang
Table "public.person_lang"
Column | Type | Modifiers
-----------+------------------------+-----------
name | character varying(25) |
age | smallint |
languages | character varying(100) |
...
$ ./ybload -d premdb -t person_lang -W --username bobr
--format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr:
...
16:50:24.463 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in 0:00:01 (READ: 1.64KB/s WRITE: 41.95 B/s)
...
premdb=# select * from person_lang;
name | age | languages
------+-----+-----------
Tom | 35 | [NULL]
Bob | 25 | [NULL]
Ruby | 30 | [NULL]
(3 rows
Parent topic:Loading Tables from Parquet Files