Stream Data From Apache Kafka to Yellowbrick

Description

This tutorial explains how to set up an integration between Apache Kafka and Yellowbrick. This integration enables streaming from a Kafka topic directly into a Yellowbrick table.

Learning Objectives

By completing this tutorial, you will understand how to do the following:

Configure Kafka Connect for Yellowbrick.
Simulate a real-time data stream.
Route messages from a Kafka topic directly into a Yellowbrick table.

Solution Overview

Apache Kafka is an open source, distributed event streaming platform used to create high-performance data pipelines, perform streaming analytics, and integrate data for mission-critical applications.

Data Sources Connecting Apache Kafka and Yellowbrick via Kafka Connect

Kafka Connect includes two types of the following connectors:

Source Connector: Imports data from source databases and streams CDC (Change Data Capture) updates, time-series data, and other real-time updates to Kafka topics. It can also collect metrics from application servers for low-latency stream processing.

Flow From Source to Kafka via Kafka Connect

Sink Connector: Delivers data from Kafka topics into target platforms, such as Yellowbrick for downstream analytics and OLAP workloads.

Flow From Kafka to Yellowbrick via Kafka Connect

Prerequisites

To complete this tutorial, ensure you have the following prerequisites:

Apache Kafka is not available on your machine.

Note: If you already have Kafka or are using a managed platform such as Confluent Cloud, please directly visit Set Up Kafka Connect to Stream Source Data to Kafka.

A Linux environment is present.

Note: If you are using Windows without the WSL (Windows Subsystem for Linux), replace the .sh commands with their PowerShell or .bat equivalents.

Access to a Yellowbrick environment.

Step-By-Step Guide

Set Up Kafka

Download Apache Kafka
Download the latest Kafka binaries from the Apache Kafka Downloads Page.

Note: Please refer the installation directory as $KAFKA_HOME throughout this tutorial.

Start Services
a. Open terminal windows, navigate to $KAFKA_HOME, and run the following commands:
bash
```
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
```
b. Verify that both Zookeeper and Kafka services start without error messages.

Test the Kafka Service

Use the following bash commands in the terminal to test the Kafka service:

bash

# Create a new test topic
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test

# List topics to verify creation
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

# Produce messages to the Kafka topic
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Hello, World
It Is a Bright New Day
Time to Innovate and Make a Change

# Consume messages from the Kafka topic
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Expected Output:

The expected output is as follows:

Hello, World
It Is a Bright New Day
Time to Innovate and Make a Change

Once verified, you might close these terminals.

Set Up Kafka Connect to Stream Source Data to Kafka

Perform the following commands in the terminal:

Configure the Source Connector
Edit connect-file-source.properties in $KAFKA_HOME/config:

bash

name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test

Download and Set Up Configuration Files
Download the required files from the Yellowbrick GitHub Repository and extract them into $KAFKA_HOME/configYB.
Verify the Plugin Path
In connect-standalone.properties, ensure the plugin.path points to the correct JAR file:
bash
```
plugin.path=libs/connect-file-3.7.0.jar
```
Prepare the Source Data
Download your-file.txt and place it in $KAFKA_HOME:
bash
```
cat your-file.txt >> ybd-source-json.txt
```

Start Kafka Connect

bash

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties

Verify Message Reception

bash

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-source --from-beginning

Configure Kafka Connect for Yellowbrick

Perform the following commands in the terminal:

Install YB Tools
Download and install YBTools.
Copy the Kafka connector JAR to $KAFKA_HOME/libs:

bash

cp /ybtools/integration/kafka/kafka-connect-yellowbrick-6.9.0-SNAPSHOT-shaded.jar $KAFKA_HOME/libs

Create a Target Table in Yellowbrick
In the Yellowbrick SQL Editor, run:

sql

CREATE SCHEMA kafka;
CREATE TABLE kafka.kafka_ybd_source_load (
    col1 VARCHAR(30),
    col2 VARCHAR(30),
    tstmp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) DISTRIBUTE RANDOM;

Configure the Sink Connector
Edit connect-YB-sink_t2.properties with your Yellowbrick connection details:

yaml

yb.hostname=myinstance.aws.yellowbrickcloud.com
yb.port=5432
yb.database=odl_user_XXXXXX_db
yb.username=odl_user_XXXXXX
yb.password=<your password>
yb.table=kafka_ybd_source_load
yb.schema=kafka
yb.columns=col1,col2

Update Kafka Connect Port
To avoid port conflicts, edit connect-standalone.properties:
bash
```
listeners=http://0.0.0.0:8085
```

Start the Sink Connector

bash

bin/connect-standalone.sh configYB/connect-standalone_t2.properties configYB/connect-YB-sink_t2.properties

Verify Data Ingestion
In the Yellowbrick SQL Editor, run:
sql
```
SELECT * FROM kafka.kafka_ybd_source_load;
```
Test Continuous Streaming
Add more data to the source file:
bash
```
cat your-file.txt >> ybd-source-json.txt
```
Query the Yellowbrick table to confirm new data gets inserted.

Clean Up

End Kafka Connect processes for both source and sink pipelines.
Stop Kafka and Zookeeper services.

Remove temporary files using the following bash command:

bash

rm /tmp/connect.offsets
rm -r /tmp/kafka-logs
rm -r /tmp/zookeeper

✅ You have successfully demonstrated data streaming from Kafka into a Yellowbrick table.

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB

Stream Data From Apache Kafka to Yellowbrick

Description

Learning Objectives

Solution Overview

Prerequisites

Step-By-Step Guide

Set Up Kafka

Set Up Kafka Connect to Stream Source Data to Kafka

Configure Kafka Connect for Yellowbrick

Clean Up