Using Airbyte with Yellowbrick

Airbyte is an open-source platform for extracting and loading data from a wide variety of sources and destinations. Yellowbrick is supported as a destination, which means you can load data from any of the supported sources into Yellowbrick. The Yellowbrick connector supports all modes of synchronization:

Full Refresh
Incremental Append
Incremental Append and Deduplication

In this tutorial you will learn how to configure the Yellowbrick Airbyte connector to load data from a source into Yellowbrick.

Prerequisites

Access to Airbyte Cloud, or to a self-managed instance of Airbyte. See the how-to guide for details on installing and running Airbyte on the AWS EKS cluster that also runs Yellowbrick.
Read and write access to an AWS S3 bucket, which acts as the data source.
Access to the aws command line client.
AWS Access Key ID and AWS Secret Access Key for the S3 source bucket. Ensure that the credentials are associated with a user that has permission to list and read objects from the bucket.
A Yellowbrick user that has been configured to enable the JSON type and functions and has permission to create a database.
Yellowbrick version 7 or greater, with support for JSON/JSONB.

Overview

You will perform the following steps:

Copy a data file to the S3 source bucket
Configure the S3 source connector to read from the bucket
Configure the Yellowbrick destination to write to a database
Synchronize data across the connection
Append and deduplicate data from the source to Yellowbrick

Part 1: Preparing the Source

Save the following JSONL data to a file part1.jsonl:

jsonl

{"id": "707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c", "metering_hour": "2025-04-09T17:00:00", "total_vcpu_seconds": 49886}

This represents fictitious telemetry data for a cloud service. Copy the file into your S3 bucket:

aws s3 cp part1.jsonl s3://airbyte-demo-1234/

Part 2: Configuring the Airbyte S3 Source Connector

In the Airbyte UI, navigate to Sources and enter "S3" into the search dialog. Select the S3 connector. Enter the S3 bucket name, in this case airbyte-demo-1234. Select the Add button next to The list of streams to sync. Change the format to JSONL and give the stream the name "telemetry". Leave the rest of the visible dialogs at their defaults and select Optional Fields. Complete the AWS Access Key ID and AWS Secret Access Key fields to give the connector access to your bucket. Scroll to the bottom of the page and select Set up source.

Part 3: Configuring the Airbyte Yellowbrick Destination Connector

On successfully setting up the S3 source connection, Airbyte prompts you to create a connection. Select Create a Connection and in the Search Airbyte Connectors dialog, enter "Yellowbrick". Select the Marketplace tab in the search results and select the Yellowbrick connector. Set the Host field to the address of your Yellowbrick Instance. You can obtain this from Yellowbrick Manager → Instance Management → Host/Port.

While in Yellowbrick Manager, create a database to receive the data from Airbyte. The database must be UTF-8 encoded:

sql

CREATE DATABASE airbyte_test ENCODING=UTF8;

In the Airbyte console, continue to complete the configuration of the Yellowbrick destination connector by entering the name of the database created previously in the DB Name field.

Add your Yellowbrick user name in the User field. Then, under Optional fields enter the password associated with the user. Finally, select Set up connection.

Part 4: Synchronize Data Between Source and Destination

The Airbyte console will prompt you to select the mode of synchronization. In this instance, under the Schema pane, select Full Refresh | Overwrite from the sync mode dropdown. Select Next to proceed. Finally select Finish & Sync.

Airbyte will proceed to copy data from the S3 bucket into a table in the airbyte_test database that it will create. The name of the table will be the same as the name of our source stream in the S3 connector — telemetry.

Examine the data in the telemetry table in Yellowbrick Manager by executing SELECT * FROM telemetry; in the airbyte_test database:

id	metering_hour	total_vcpu_seconds	_ab_source_file_url	_ab_source_file_last_modified	_airbyte_raw_id	_airbyte_extracted_at	_airbyte_meta
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c	2025-04-09T17:00:00	49886	part1.jsonl	2025-04-16T12:24:45.000000Z	5dc603de-8292-4539-a13e-bf6dff9a0bb3	2025-04-16T12:58:01.169Z	{"changes":[]}

You will see that the JSONL source data has been flattened into a relational schema by the destination connector. You will also see that in addition to the three fields in our source, other metadata fields have been added by Airbyte.

Part 5: Appending and Deduping Data

The Yellowbrick connector supports appending new data, in addition to fully refreshing the data in the table. To configure appends, navigate to Connections → Schema in Airbyte. Change the sync mode to Incremental | Append + Dedupe. You must also set a primary key. Airbyte needs this in order to avoid adding duplicate data to the table. Select the id and metering_hour as our compound primary key. Save the changes. Select the Sync now at the top of the Airbyte window. After the sync has completed, you will notice that the telemetry still only contains a single record. In Append mode, Airbyte tracks the last file loaded from S3 and doesn't reload previously loaded files.

Create a new file part5.jsonl containing a new telemetry record:

jsonl

{"id": "707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c", "metering_hour": "2025-04-09T18:00:00", "total_vcpu_seconds": 49886, "num_vcpus": 16}

Notice that, in addition to an updated metering hour, the data contains a new field num_vcpus. Copy this file into your S3 source bucket. In Airbyte, navigate to Connections → Schema and refresh the source schema. Select Sync now once more.

Once the sync has finished, examine the contents of the telemetry table:

id	num_vcpus	metering_hour	total_vcpu_seconds	_ab_source_file_url	_ab_source_file_last_modified	_airbyte_raw_id	_airbyte_extracted_at	_airbyte_meta
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c	16	2025-04-09T18:00:00	49886	part5.jsonl	2025-04-16T14:18:02.000000Z	0087fe05-260d-4639-87c4-737642a9913b	2025-04-16T14:18:43.959Z	{"changes":[]}
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c		2025-04-09T17:00:00	49886	part1.jsonl	2025-04-16T12:24:45.000000Z	83f3f293-3899-4d5e-a34c-7aee40a84df0	2025-04-16T14:13:26.781Z	{"changes":[]}

You will see that, in addition to a second record in the table, the table also has a new column, num_vcpus. Airbyte has altered the schema of the destination table in Yellowbrick to match the new source column automatically.

See the Airbyte documentation for information on setting up scheduled synchronization, and connecting Yellowbrick to other Airbyte sources.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

FROM Clause

Using Airbyte with Yellowbrick

Prerequisites

Overview

Part 1: Preparing the Source

Part 2: Configuring the Airbyte S3 Source Connector

Part 3: Configuring the Airbyte Yellowbrick Destination Connector

Part 4: Synchronize Data Between Source and Destination

Part 5: Appending and Deduping Data