Appearance
Handling Duplicate Rows
When the --write-op option is set to insert, update, or upsert, a decision has to be made about matching rows when duplicates are found in the source files. If duplicates exist, ybload matches on one of them in order to insert, update, or upsert the row in question.
Yellowbrick tables do not enforce primary key and uniqueness constraints; therefore, duplicate rows are generally allowed to be bulk-loaded or inserted. For all loads, duplicate handling applies to new source rows; existing duplicate rows in the table are not affected. For update and upsert loads, rows in the source file may be deduplicated, depending on the duplicate-handling method that you choose.
The --duplicate-handler option provides different ways to manage duplicates:
order by(SQL clause): source rows are sorted, and the first matching row that is found is chosen to update each row in the target table.none: source rows are assumed to be unique; theybloadbehavior is undefined if duplicate source rows exist. Sometimes the load will fail if duplicate rows are found in the source file. This option is non-deterministic and not recommended if you know the source file has duplicates. If you know that the source file cannot have duplicates,noneis recommended because the load will run faster.random: non-deterministic; a random matching source row is used to update each row in the target table.
The following tables describe the behavior when different duplicate-handling methods are used for each type of load.
INSERT Loads
| Number of matching rows in source file | Behavior |
|---|---|
| 0 | No rows inserted. |
| 1 | Single row inserted. |
>1 and --duplicate-handler none | Each matching source row is inserted. |
| >1 and `--duplicate-handler order by | random` |
UPDATE Loads
| Number of matching rows in source file | Number of matching rows in target table | Behavior |
|---|---|---|
| 0 | 0, 1, or >1 | No updates. |
| 1 | 0 | No updates. |
| 1 | 1 | The target row is updated with column values from the source row. |
| 1 | >1 | Each matching target row is updated with column values from the source row. |
>1 and --duplicate-handler none | 0 | No updates. |
>1 and --duplicate-handler none | 1 or >1 | Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - ybload operation fails with an error such as "attempted to update a row multiple times." |
| >1 and `--duplicate-handler order by | random` | 0 |
| >1 and `--duplicate-handler order by | random` | 1 |
| >1 and `--duplicate-handler order by | random` | >1 |
UPSERT Loads
| Number of matching rows in source file | Number of matching rows in target table | Behavior |
|---|---|---|
| 0 | 0, 1, or >1 | No updates or inserts. |
| 1 | 0 | 1 target row inserted. |
| 1 | 1 | The target row is updated with column values from the source row. |
| 1 | >1 | Each matching target row is updated with column values from the source row. |
>1 and --duplicate-handler none | 0 | Multiple target rows are inserted (one for each matching source row). |
>1 and --duplicate-handler none | 1 or >1 | Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - ybload operation fails with an error such as "attempted to update a row multiple times." |
| >1 and `--duplicate-handler order by | random` | 0 |
| >1 and `--duplicate-handler order by | random` | 1 |
| >1 and `--duplicate-handler order by | random` | >1 |
DELETE Loads
The --duplicate-handler option is ignored for delete loads.
| Number of matching rows in source file | Number of matching rows in target table | Behavior |
|---|---|---|
| 0 | 0, 1, or >1 | No deletes. |
| 1 | 0 | No deletes. |
| 1 or >1 | 1 or >1 | All matching target rows are deleted. |
| >1 | 0 | No deletes. |
| >1 | 1 or >1 | All matching target rows are deleted. |