Skip to content

Parquet Load Examples

This section shows a few examples of ybload commands that load from parquet source files.

Simple parquet load from the local file system

The following command loads the match table from a local file called match.snappy.parquet:

$ ybload -d premdb --username yellowbrick -t match -W --format parquet 
--logfile /tmp/match.log --logfile-log-level info /tmp/match.snappy.parquet
Password for user yellowbrick: 
...
10:15:40.089 [ INFO] Logfile written to /tmp/match.log
10:15:40.142 [ INFO] Gathering metadata on input files
10:15:40.658 [ INFO] Bad rows will be written to /tmp/match.snappy.parquet.20211215101540.bad
...
10:15:41.151 [ INFO] Flushing last 8606 rows (of 8606 total) in transaction #1 for match
10:15:41.159 [ INFO] Committing 8606 rows into transaction #1 for match ...
10:15:41.186 [ INFO] Committed transaction #1 after a total of 292604 bytes and 8606 good rows for match
10:15:41.201 [ INFO] READ:41.07KB(17.49KB/s). ROWS G/B: 8606/0( 3.58K/s). WRITE:285.7KB(121.7KB/s).  TIME E/R:   0:00:02/ --:--:--
10:15:41.202 [ INFO] SUCCESSFUL BULK LOAD: Loaded 8606 good rows from 1 source(s) in   0:00:02 (READ: 17.49KB/s WRITE: 121.7KB/s)

Loading a table when source and target column names do not align

This example demonstrates a potential problem (and a solution) when the parquet schema and the DDL for the target table do not have identical column names. Here is the schema of a table called newmatch:

premdb=# \d newmatch
               Table "public.newmatch"
   Column    |            Type             | Modifiers 
-------------+-----------------------------+-----------
 seasonid    | smallint                    | 
 newmatchday | timestamp without time zone | 
 htid        | smallint                    | 
 atid        | smallint                    | 
 ftscore     | character(3)                | 
 htscore     | character(3)                | 

Distribution: Hash (seasonid)

Here is the schema of the match.snappy.parquet file:

% parquet-tools schema match.snappy.parquet
message spark_schema {
  optional int32 seasonid;
  optional binary matchday (STRING);
  optional int32 htid;
  optional int32 atid;
  optional binary ftscore (STRING);
  optional binary htscore (STRING);
}

Note that the second column newmatchday in the table is called matchday in the parquet file. Consequently, when you load newmatch from this file, the load completes but puts NULL values into the newmatchday column:

premdb=# select * from newmatch where seasonid=12 order by 1 limit 5;
 seasonid | newmatchday | htid | atid | ftscore | htscore 
----------+-------------+------+------+---------+---------
      12 | [NULL]      |   42 |   75 | 1-2     | 0-2
      12 | [NULL]      |   42 |   76 | 0-0     | 0-0
      12 | [NULL]      |   42 |   77 | 1-0     | 0-0
      12 | [NULL]      |   42 |   81 | 4-3     | 2-1
      12 | [NULL]      |   42 |   86 | 1-3     | 0-2
(5 rows)

The solution in this case is to use the --field-defs option to override the default field names specified by the source file. You must specify all of the source fields, but you can rename matchday as newmatchday so that the target table accepts it.

$ ybload -d premdb --username bobr -W -t newmatch_matchday_not_null -W --format parquet 
--field-defs seasonid,newmatchday,htid,atid,ftscore,htscore
/home/brumsby/premdb-snappy-parquet/match.snappy.parquet
Password for user bobr: 
...

Parquet load with nested data inserted as a JSON string

The following example loads a source file with the following schema to demonstrate how nested data structures in parquet data can be serialized into a single column in the target table:

% parquet-tools schema person_data_1.parquet
message schema {
  optional binary name (STRING);
  optional int64 age;
  optional group languages {
   optional binary mother_language (STRING);
   optional group other_languages (LIST) {
     repeated group list {
       optional binary item (STRING);
     }
   }
  }
}

The target table for the load, person_lang_not_null_default, has the following DDL:

premdb=# \d person_lang_not_null_default;
          Table "public.person_lang_not_null_default"
  Column   |          Type          |         Modifiers          
-----------+------------------------+----------------------------
 name      | character varying(25)  | 
 age       | smallint               | 
 languages | character varying(100) | not null default 'English'

Note that the third column, languages, is declared NOT NULL and has a default value. This column is intended to map to the nested languages structure in the parquet file, which consists of languages, mother_language, and other_languages values.

The following ybload command uses the --serialize-nested-as-json option to load the nested data as a JSON string into a single VARCHAR column:

$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr 
--format parquet --serialize-nested-as-json /home/bobr/pqdata/person_data_1.parquet
Password for user bobr: 
...
15:46:02.550 [ INFO] Committed transaction #1 after a total of 260 bytes and 3 good rows for person_lang_not_null_default
15:46:02.562 [ INFO] READ: 1.80KB( 1.63KB/s). ROWS G/B: 3/0( 2.73 /s). WRITE:260.0 B(236.2 B/s).  TIME E/R:   0:00:01/ --:--:--
15:46:02.562 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in   0:00:01 (READ:  1.63KB/s WRITE: 236.2 B/s)

After the load, the table contains these three rows:

premdb=# select * from person_lang_not_null_default;
 name | age |                                   languages                                   
------+-----+-------------------------------------------------------------------------------
 Tom  |  35 | {"mother_language":"Spanish","other_languages":["English","German","French"]}
 Bob  |  25 | {"mother_language":"English","other_languages":["German","French"]}
 Ruby |  30 | {"mother_language":"English","other_languages":["German","French"]}
(3 rows)

The same ybload command without the --serialize-nested-as-json option would cause the load to fail:

$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr 
--format parquet /home/bobr/pqdata/person_data_1.parquet
Password for user bobr: 
...
15:54:54.851 [ INFO] Gathering metadata on input files
15:54:55.235 [FATAL] Parquet Error: Nested structure not supported, at field: 
optional group languages {
  optional binary mother_language (STRING);
  optional group other_languages (LIST) {
   repeated group list {
     optional binary item (STRING);
   }
  }
}

See the following example for details about working around this issue by ignoring the unsupported nested schema.

Parquet load with nested data, ignoring unsupported data types

This example uses the same source file and target table as the previous example. In this case, the --ignore-unsupported-schema is used to ignore the nested data and instead load the third column of the table with its declared DEFAULT value.

$ ./ybload -d premdb -t person_lang_not_null_default -W --username bobr 
--format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr: 
...
16:32:19.326 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in   0:00:01 (READ:  1.59KB/s WRITE: 40.65 B/s)

The following rows are loaded:

premdb=# select * from person_lang_not_null_default;
 name | age | languages 
------+-----+-----------
 Bob  |  25 | English
 Ruby |  30 | English
 Tom  |  35 | English
(3 rows)

If the languages column does not have a DEFAULT value and cannot be NULL, the same load fails:

premdb=# \d person_lang_not_null
     Table "public.person_lang_not_null"
  Column   |          Type          | Modifiers 
-----------+------------------------+-----------
 name      | character varying(25)  | 
 age       | smallint               | 
 languages | character varying(100) | not null

...
$ ./ybload -d premdb -t person_lang_not_null -W --username bobr --format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr: 
...
16:43:27.146 [ INFO] Gathering metadata on input files
16:43:27.539 [FATAL] No source field found for the following destination columns:
   languages
Source Fields:
   name
   age
Destination Columns:
   name
   age
   languages

If the languages column is nullable, the same load succeeds, and NULL values are loaded into the languages column:

premdb=# \d person_lang
          Table "public.person_lang"
  Column   |          Type          | Modifiers 
-----------+------------------------+-----------
 name      | character varying(25)  | 
 age       | smallint               | 
 languages | character varying(100) | 

...
$ ./ybload -d premdb -t person_lang -W --username bobr 
--format parquet --ignore-unsupported-schema /home/bobr/pqdata/person_data_1.parquet
Password for user bobr: 
...
16:50:24.463 [ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s) in   0:00:01 (READ:  1.64KB/s WRITE: 41.95 B/s)
...
premdb=# select * from person_lang;
 name | age | languages 
------+-----+-----------
 Tom  |  35 | [NULL]
 Bob  |  25 | [NULL]
 Ruby |  30 | [NULL]
(3 rows

Parent topic:Loading Tables from Parquet Files