Skip to content

Stream Data From Apache Kafka to Yellowbrick

Description

This tutorial explains how to set up an integration between Apache Kafka and Yellowbrick. This integration enables streaming from a Kafka topic directly into a Yellowbrick table.

Learning Objectives

By completing this tutorial, you will understand how to do the following:

  • Configure Kafka Connect for Yellowbrick.
  • Simulate a real-time data stream.
  • Route messages from a Kafka topic directly into a Yellowbrick table.

Solution Overview

Apache Kafka is an open source, distributed event streaming platform used to create high-performance data pipelines, perform streaming analytics, and integrate data for mission-critical applications.

Data Sources Connecting Apache Kafka and Yellowbrick via Kafka Connect

Kafka Connect includes two types of the following connectors:

  • Source Connector: Imports data from source databases and streams CDC (Change Data Capture) updates, time-series data, and other real-time updates to Kafka topics. It can also collect metrics from application servers for low-latency stream processing.

Flow From Source to Kafka via Kafka Connect

  • Sink Connector: Delivers data from Kafka topics into target platforms, such as Yellowbrick for downstream analytics and OLAP workloads.

Flow From Kafka to Yellowbrick via Kafka Connect

Prerequisites

To complete this tutorial, ensure you have the following prerequisites:

  • Apache Kafka is not available on your machine.

Note: If you already have Kafka or are using a managed platform such as Confluent Cloud, please directly visit Set Up Kafka Connect to Stream Source Data to Kafka.

  • A Linux environment is present.

Note: If you are using Windows without the WSL (Windows Subsystem for Linux), replace the .sh commands with their PowerShell or .bat equivalents.

  • Access to a Yellowbrick environment.

Step-By-Step Guide

Set Up Kafka

  1. Download Apache Kafka
    Download the latest Kafka binaries from the Apache Kafka Downloads Page.

Note: Please refer the installation directory as $KAFKA_HOME throughout this tutorial.

  1. Start Services
    a. Open terminal windows, navigate to $KAFKA_HOME, and run the following commands:

    bash
    bin/zookeeper-server-start.sh config/zookeeper.properties
    bin/kafka-server-start.sh config/server.properties

    b. Verify that both Zookeeper and Kafka services start without error messages.

  2. Test the Kafka Service

    Use the following bash commands in the terminal to test the Kafka service:

    bash
    # Create a new test topic
    bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
    
    # List topics to verify creation
    bin/kafka-topics.sh --list --bootstrap-server localhost:9092
    
    # Produce messages to the Kafka topic
    bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
    Hello, World
    It Is a Bright New Day
    Time to Innovate and Make a Change
    
    # Consume messages from the Kafka topic
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
  3. Expected Output:

The expected output is as follows:

Hello, World
It Is a Bright New Day
Time to Innovate and Make a Change

Once verified, you might close these terminals.

Set Up Kafka Connect to Stream Source Data to Kafka

Perform the following commands in the terminal:

  1. Configure the Source Connector
    Edit connect-file-source.properties in $KAFKA_HOME/config:

    bash
    name=local-file-source
    connector.class=FileStreamSource
    tasks.max=1
    file=test.txt
    topic=connect-test
  2. Download and Set Up Configuration Files
    Download the required files from the Yellowbrick GitHub Repository and extract them into $KAFKA_HOME/configYB.

  3. Verify the Plugin Path
    In connect-standalone.properties, ensure the plugin.path points to the correct JAR file:

    bash
    plugin.path=libs/connect-file-3.7.0.jar
  4. Prepare the Source Data
    Download your-file.txt and place it in $KAFKA_HOME:

    bash
    cat your-file.txt >> ybd-source-json.txt
  5. Start Kafka Connect

    bash
    bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
  6. Verify Message Reception

    bash
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-source --from-beginning

Configure Kafka Connect for Yellowbrick

Perform the following commands in the terminal:

  1. Install YB Tools
    Download and install YBTools.
    Copy the Kafka connector JAR to $KAFKA_HOME/libs:

    bash
    cp /ybtools/integration/kafka/kafka-connect-yellowbrick-6.9.0-SNAPSHOT-shaded.jar $KAFKA_HOME/libs
  2. Create a Target Table in Yellowbrick
    In the Yellowbrick SQL Editor, run:

    sql
    CREATE SCHEMA kafka;
    CREATE TABLE kafka.kafka_ybd_source_load (
        col1 VARCHAR(30),
        col2 VARCHAR(30),
        tstmp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    ) DISTRIBUTE RANDOM;
  3. Configure the Sink Connector
    Edit connect-YB-sink_t2.properties with your Yellowbrick connection details:

    yaml
    yb.hostname=myinstance.aws.yellowbrickcloud.com
    yb.port=5432
    yb.database=odl_user_XXXXXX_db
    yb.username=odl_user_XXXXXX
    yb.password=<your password>
    yb.table=kafka_ybd_source_load
    yb.schema=kafka
    yb.columns=col1,col2
  4. Update Kafka Connect Port
    To avoid port conflicts, edit connect-standalone.properties:

    bash
    listeners=http://0.0.0.0:8085
  5. Start the Sink Connector

    bash
    bin/connect-standalone.sh configYB/connect-standalone_t2.properties configYB/connect-YB-sink_t2.properties
  6. Verify Data Ingestion
    In the Yellowbrick SQL Editor, run:

    sql
    SELECT * FROM kafka.kafka_ybd_source_load;
  7. Test Continuous Streaming
    Add more data to the source file:

    bash
    cat your-file.txt >> ybd-source-json.txt

    Query the Yellowbrick table to confirm new data gets inserted.

Clean Up

  1. End Kafka Connect processes for both source and sink pipelines.

  2. Stop Kafka and Zookeeper services.

  3. Remove temporary files using the following bash command:

    bash
    rm /tmp/connect.offsets
    rm -r /tmp/kafka-logs
    rm -r /tmp/zookeeper

✅ You have successfully demonstrated data streaming from Kafka into a Yellowbrick table.