Skip to content

Handling Duplicate Rows

When the --write-op option is set to insert, update, or upsert, a decision has to be made about matching rows when duplicates are found in the source files. If duplicates exist, ybload matches on one of them in order to insert, update, or upsert the row in question.

Yellowbrick tables do not enforce primary key and uniqueness constraints; therefore, duplicate rows are generally allowed to be bulk-loaded or inserted. For all loads, duplicate handling applies to new source rows; existing duplicate rows in the table are not affected. For update and upsert loads, rows in the source file may be deduplicated, depending on the duplicate-handling method that you choose.

The --duplicate-handler option provides different ways to manage duplicates:

  • order by (SQL clause): source rows are sorted, and the first matching row that is found is chosen to update each row in the target table.
  • none: source rows are assumed to be unique; the ybload behavior is undefined if duplicate source rows exist. Sometimes the load will fail if duplicate rows are found in the source file. This option is non-deterministic and not recommended if you know the source file has duplicates. If you know that the source file cannot have duplicates, none is recommended because the load will run faster.
  • random: non-deterministic; a random matching source row is used to update each row in the target table.

The following tables describe the behavior when different duplicate-handling methods are used for each type of load.

INSERT Loads

Number of matching rows in source fileBehavior
0No rows inserted.
1Single row inserted.
>1 and --duplicate-handler noneEach matching source row is inserted.
>1 and `--duplicate-handler order byrandom`

UPDATE Loads

Number of matching rows in source fileNumber of matching rows in target tableBehavior
00, 1, or >1No updates.
10No updates.
11The target row is updated with column values from the source row.
1>1Each matching target row is updated with column values from the source row.
>1 and --duplicate-handler none0No updates.
>1 and --duplicate-handler none1 or >1Not recommended, non-deterministic. Possible outcomes:

- Matching target rows get updated with column values from one of the source rows (but which source row is unknown).
- ybload operation fails with an error such as "attempted to update a row multiple times."

>1 and `--duplicate-handler order byrandom`0
>1 and `--duplicate-handler order byrandom`1
>1 and `--duplicate-handler order byrandom`>1

UPSERT Loads

Number of matching rows in source fileNumber of matching rows in target tableBehavior
00, 1, or >1No updates or inserts.
101 target row inserted.
11The target row is updated with column values from the source row.
1>1Each matching target row is updated with column values from the source row.
>1 and --duplicate-handler none0Multiple target rows are inserted (one for each matching source row).
>1 and --duplicate-handler none1 or >1Not recommended, non-deterministic. Possible outcomes:

- Matching target rows get updated with column values from one of the source rows (but which source row is unknown).
- ybload operation fails with an error such as "attempted to update a row multiple times."

>1 and `--duplicate-handler order byrandom`0
>1 and `--duplicate-handler order byrandom`1
>1 and `--duplicate-handler order byrandom`>1

DELETE Loads

The --duplicate-handler option is ignored for delete loads.

Number of matching rows in source fileNumber of matching rows in target tableBehavior
00, 1, or >1No deletes.
10No deletes.
1 or >11 or >1All matching target rows are deleted.
>10No deletes.
>11 or >1All matching target rows are deleted.