Skip to content

Using Airbyte with Yellowbrick

Airbyte is an open-source platform for extracting and loading data from a wide variety of sources and destinations. Yellowbrick is supported as a destination, which means you can load data from any of the supported sources into Yellowbrick. The Yellowbrick connector supports all modes of synchronization:

  • Full Refresh
  • Incremental Append
  • Incremental Append and Deduplication

In this tutorial you will learn how to configure the Yellowbrick Airbyte connector to load data from a source into Yellowbrick.

Prerequisites

  • Access to Airbyte Cloud, or to a self-managed instance of Airbyte. See the how-to guide for details on installing and running Airbyte on the AWS EKS cluster that also runs Yellowbrick.
  • Read and write access to an AWS S3 bucket, which acts as the data source.
  • Access to the aws command line client.
  • AWS Access Key ID and AWS Secret Access Key for the S3 source bucket. Ensure that the credentials are associated with a user that has permission to list and read objects from the bucket.
  • A Yellowbrick user that has been configured to enable the JSON type and functions and has permission to create a database.
  • Yellowbrick version 7 or greater, with support for JSON/JSONB.

Overview

You will perform the following steps:

  1. Copy a data file to the S3 source bucket
  2. Configure the S3 source connector to read from the bucket
  3. Configure the Yellowbrick destination to write to a database
  4. Synchronize data across the connection
  5. Append and deduplicate data from the source to Yellowbrick

Part 1: Preparing the Source

Save the following JSONL data to a file part1.jsonl:

jsonl
{"id": "707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c", "metering_hour": "2025-04-09T17:00:00", "total_vcpu_seconds": 49886}

This represents fictitious telemetry data for a cloud service. Copy the file into your S3 bucket:

sh
aws s3 cp part1.jsonl s3://airbyte-demo-1234/

Part 2: Configuring the Airbyte S3 Source Connector

In the Airbyte UI, navigate to Sources and enter "S3" into the search dialog. Select the S3 connector. Enter the S3 bucket name, in this case airbyte-demo-1234. Select the Add button next to The list of streams to sync. Change the format to JSONL and give the stream the name "telemetry". Leave the rest of the visible dialogs at their defaults and select Optional Fields. Complete the AWS Access Key ID and AWS Secret Access Key fields to give the connector access to your bucket. Scroll to the bottom of the page and select Set up source.

Part 3: Configuring the Airbyte Yellowbrick Destination Connector

On successfully setting up the S3 source connection, Airbyte prompts you to create a connection. Select Create a Connection and in the Search Airbyte Connectors dialog, enter "Yellowbrick". Select the Marketplace tab in the search results and select the Yellowbrick connector. Set the Host field to the address of your Yellowbrick Instance. You can obtain this from Yellowbrick ManagerInstance ManagementHost/Port.

While in Yellowbrick Manager, create a database to receive the data from Airbyte. The database must be UTF-8 encoded:

sql
CREATE DATABASE airbyte_test ENCODING=UTF8;

In the Airbyte console, continue to complete the configuration of the Yellowbrick destination connector by entering the name of the database created previously in the DB Name field.

Add your Yellowbrick user name in the User field. Then, under Optional fields enter the password associated with the user. Finally, select Set up connection.

Part 4: Synchronize Data Between Source and Destination

The Airbyte console will prompt you to select the mode of synchronization. In this instance, under the Schema pane, select Full Refresh | Overwrite from the sync mode dropdown. Select Next to proceed. Finally select Finish & Sync.

Airbyte will proceed to copy data from the S3 bucket into a table in the airbyte_test database that it will create. The name of the table will be the same as the name of our source stream in the S3 connector — telemetry.

Examine the data in the telemetry table in Yellowbrick Manager by executing SELECT * FROM telemetry; in the airbyte_test database:

idmetering_hourtotal_vcpu_seconds_ab_source_file_url_ab_source_file_last_modified_airbyte_raw_id_airbyte_extracted_at_airbyte_meta
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c2025-04-09T17:00:0049886part1.jsonl2025-04-16T12:24:45.000000Z5dc603de-8292-4539-a13e-bf6dff9a0bb32025-04-16T12:58:01.169Z{"changes":[]}

You will see that the JSONL source data has been flattened into a relational schema by the destination connector. You will also see that in addition to the three fields in our source, other metadata fields have been added by Airbyte.

Part 5: Appending and Deduping Data

The Yellowbrick connector supports appending new data, in addition to fully refreshing the data in the table. To configure appends, navigate to ConnectionsSchema in Airbyte. Change the sync mode to Incremental | Append + Dedupe. You must also set a primary key. Airbyte needs this in order to avoid adding duplicate data to the table. Select the id and metering_hour as our compound primary key. Save the changes. Select the Sync now at the top of the Airbyte window. After the sync has completed, you will notice that the telemetry still only contains a single record. In Append mode, Airbyte tracks the last file loaded from S3 and doesn't reload previously loaded files.

Create a new file part5.jsonl containing a new telemetry record:

jsonl
{"id": "707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c", "metering_hour": "2025-04-09T18:00:00", "total_vcpu_seconds": 49886, "num_vcpus": 16}

Notice that, in addition to an updated metering hour, the data contains a new field num_vcpus. Copy this file into your S3 source bucket. In Airbyte, navigate to ConnectionsSchema and refresh the source schema. Select Sync now once more.

Once the sync has finished, examine the contents of the telemetry table:

idnum_vcpusmetering_hourtotal_vcpu_seconds_ab_source_file_url_ab_source_file_last_modified_airbyte_raw_id_airbyte_extracted_at_airbyte_meta
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c162025-04-09T18:00:0049886part5.jsonl2025-04-16T14:18:02.000000Z0087fe05-260d-4639-87c4-737642a9913b2025-04-16T14:18:43.959Z{"changes":[]}
707754c0-2d1e-4e8a-abd8-0ed4b4a02a5c2025-04-09T17:00:0049886part1.jsonl2025-04-16T12:24:45.000000Z83f3f293-3899-4d5e-a34c-7aee40a84df02025-04-16T14:13:26.781Z{"changes":[]}

You will see that, in addition to a second record in the table, the table also has a new column, num_vcpus. Airbyte has altered the schema of the destination table in Yellowbrick to match the new source column automatically.

See the Airbyte documentation for information on setting up scheduled synchronization, and connecting Yellowbrick to other Airbyte sources.