Handling Duplicate Rows

When the --write-op option is set to insert, update, or upsert, a decision has to be made about matching rows when duplicates are found in the source files. If duplicates exist, ybload matches on one of them in order to insert, update, or upsert the row in question.

Yellowbrick tables do not enforce primary key and uniqueness constraints; therefore, duplicate rows are generally allowed to be bulk-loaded or inserted. For all loads, duplicate handling applies to new source rows; existing duplicate rows in the table are not affected. For update and upsert loads, rows in the source file may be deduplicated, depending on the duplicate-handling method that you choose.

The --duplicate-handler option provides different ways to manage duplicates:

order by (SQL clause): source rows are sorted, and the first matching row that is found is chosen to update each row in the target table.
none: source rows are assumed to be unique; the ybload behavior is undefined if duplicate source rows exist. Sometimes the load will fail if duplicate rows are found in the source file. This option is non-deterministic and not recommended if you know the source file has duplicates. If you know that the source file cannot have duplicates, none is recommended because the load will run faster.
random: non-deterministic; a random matching source row is used to update each row in the target table.

The following tables describe the behavior when different duplicate-handling methods are used for each type of load.

INSERT Loads

Number of matching rows in source file	Behavior
0	No rows inserted.
1	Single row inserted.
>1 and `--duplicate-handler none`	Each matching source row is inserted.
>1 and `--duplicate-handler order by	random`

UPDATE Loads

Number of matching rows in source file	Number of matching rows in target table	Behavior
0	0, 1, or >1	No updates.
1	0	No updates.
1	1	The target row is updated with column values from the source row.
1	>1	Each matching target row is updated with column values from the source row.
>1 and `--duplicate-handler none`	0	No updates.
>1 and `--duplicate-handler none`	1 or >1	Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - `ybload` operation fails with an error such as "attempted to update a row multiple times."
>1 and `--duplicate-handler order by	random`	0
>1 and `--duplicate-handler order by	random`	1
>1 and `--duplicate-handler order by	random`	>1

UPSERT Loads

Number of matching rows in source file	Number of matching rows in target table	Behavior
0	0, 1, or >1	No updates or inserts.
1	0	1 target row inserted.
1	1	The target row is updated with column values from the source row.
1	>1	Each matching target row is updated with column values from the source row.
>1 and `--duplicate-handler none`	0	Multiple target rows are inserted (one for each matching source row).
>1 and `--duplicate-handler none`	1 or >1	Not recommended, non-deterministic. Possible outcomes: - Matching target rows get updated with column values from one of the source rows (but which source row is unknown). - `ybload` operation fails with an error such as "attempted to update a row multiple times."
>1 and `--duplicate-handler order by	random`	0
>1 and `--duplicate-handler order by	random`	1
>1 and `--duplicate-handler order by	random`	>1

DELETE Loads

The --duplicate-handler option is ignored for delete loads.

Number of matching rows in source file	Number of matching rows in target table	Behavior
0	0, 1, or >1	No deletes.
1	0	No deletes.
1 or >1	1 or >1	All matching target rows are deleted.
>1	0	No deletes.
>1	1 or >1	All matching target rows are deleted.

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

sys.lock

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Loading from Amazon S3

Loading from Azure Blob Storage

Setting up and Running a Spark Job

Setting Up the ybrelay Service

LDAP Authentication

Synchronizing Users and Groups

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Vanity DNS

Yellowbrick Manager

SQL-Based Loads from External Storage

CLI Install Instructions

Permissions

Private Install Instructions

Azure Reference Terraform

Public Install Instructions

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

SELECT

GROUP BY Clause

Subqueries

Data Type Casting

DECIMAL

SQL String Constants

Aggregate Functions

Conditional Expressions

Datetime Functions

Formatting Functions

JSON Functions

Mathematical Functions

Network Address Functions

Pattern Matching

Regular Expression Details

SQL Operators and Pattern Matching Functions

SQL Conditions

SQL User Defined Function (UDF)

SQL UDF Create Function

String Functions

ENCRYPT_KS

System Functions

Type-Safe Casting Functions

Window Functions

Creating WLM Resource Pools

Creating WLM Rules

Rule Examples

Handling Duplicate Rows ​

INSERT Loads ​

UPDATE Loads ​

UPSERT Loads ​

DELETE Loads ​

Handling Duplicate Rows

INSERT Loads

UPDATE Loads

UPSERT Loads

DELETE Loads