Appearance
Handling Duplicate Rows
When the --write-op
option is set to insert
, update
, or upsert
, a decision has to be made about matching rows when duplicates are found in the source files. If duplicates exist, ybload
matches on one of them in order to insert, update, or upsert the row in question.
Yellowbrick tables do not enforce primary key and uniqueness constraints; therefore, duplicate rows are generally allowed to be bulk-loaded or inserted. For all loads, duplicate handling applies to new source rows; existing duplicate rows in the table are not affected. For update
and upsert
loads, rows in the source file may be deduplicated, depending on the duplicate-handling method that you choose.
The --duplicate-handler
option provides different ways to manage duplicates:
order by
(SQL clause): source rows are sorted, and the first matching row that is found is chosen to update each row in the target table.none
: source rows are assumed to be unique; theybload
behavior is undefined if duplicate source rows exist. Sometimes the load will fail if duplicate rows are found in the source file. This option is non-deterministic and not recommended if you know the source file has duplicates. If you know that the source file cannot have duplicates,none
is recommended because the load will run faster.random
: non-deterministic; a random matching source row is used to update each row in the target table.
The following tables describe the behavior when different duplicate-handling methods are used for each type of load.
INSERT Loads
Number of matching rows in source file | Behavior |
---|---|
0 | No rows inserted. |
1 | Single row inserted. |
>1 and --duplicate-handler none | Each matching source row is inserted. |
>1 and `--duplicate-handler order by | random` |
UPDATE Loads
Number of matching rows in source file | Number of matching rows in target table | Behavior |
---|---|---|
0 | 0, 1, or >1 | No updates. |
1 | 0 | No updates. |
1 | 1 | The target row is updated with column values from the source row. |
1 | >1 | Each matching target row is updated with column values from the source row. |
>1 and --duplicate-handler none | 0 | No updates. |
>1 and --duplicate-handler none | 1 or >1 | Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - ybload operation fails with an error such as "attempted to update a row multiple times." |
>1 and `--duplicate-handler order by | random` | 0 |
>1 and `--duplicate-handler order by | random` | 1 |
>1 and `--duplicate-handler order by | random` | >1 |
UPSERT Loads
Number of matching rows in source file | Number of matching rows in target table | Behavior |
---|---|---|
0 | 0, 1, or >1 | No updates or inserts. |
1 | 0 | 1 target row inserted. |
1 | 1 | The target row is updated with column values from the source row. |
1 | >1 | Each matching target row is updated with column values from the source row. |
>1 and --duplicate-handler none | 0 | Multiple target rows are inserted (one for each matching source row). |
>1 and --duplicate-handler none | 1 or >1 | Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - ybload operation fails with an error such as "attempted to update a row multiple times." |
>1 and `--duplicate-handler order by | random` | 0 |
>1 and `--duplicate-handler order by | random` | 1 |
>1 and `--duplicate-handler order by | random` | >1 |
DELETE Loads
The --duplicate-handler
option is ignored for delete
loads.
Number of matching rows in source file | Number of matching rows in target table | Behavior |
---|---|---|
0 | 0, 1, or >1 | No deletes. |
1 | 0 | No deletes. |
1 or >1 | 1 or >1 | All matching target rows are deleted. |
>1 | 0 | No deletes. |
>1 | 1 or >1 | All matching target rows are deleted. |