Setting up Yellowbrick as a Source in Databricks Unity Catalog

Overview

Databricks Unity Catalog lets users connect to external data sources like Yellowbrick. This enables users to query Yellowbrick directly from Databricks notebooks or SQL Warehouses. A common use case is for data scientists to bring cleansed and vetted "Gold" data from Yellowbrick into Databricks to build predictive models or run AI/ML pipelines.

Limitations

DDL and DML statements are not permitted (e.g., INSERT, UPDATE, DELETE, CREATE, DROP).
Joins are not pushed down to Yellowbrick. Joins are executed within Databricks, requiring Yellowbrick data to be moved first. Best practice is to create a view in Yellowbrick that performs the join and then reference the view from Databricks.

Prerequisites

If Yellowbrick is installed in a private VPC, the VPCs for Yellowbrick and Databricks must be peered. See VPC Peering for more information.

Detailed Steps

From the Catalog sidebar in the Databricks console, click +, then select Create a Connection.
On the Connection Basics tab:
- Name: [Name for your connection]
- Connection Type: PostgreSQL
- Click Next.
On the Authentication tab:
- Host: [Your FQDN for Yellowbrick Instance, e.g., yb_prod.elb.us-east-1.amazonaws.com]
- Port: 5432 (or your custom port if changed)
- User: [Username]
- Password: [Password]
- Click Next.
On the Catalog Basics tab:
- Catalog Name: [Name for the new DB Catalog representing the connection]
- Database: [Yellowbrick Database Name]
- Click Test:
  - Select an existing compute or start a serverless instance.
  - Ensure the connection is successful.
- Click Create Catalog.
On the Access tab:
- Keep the defaults or modify as needed.
- Click Next.
On the Metadata tab:
- Add metadata for Databricks documentation if needed.
- Click Save.

Using the New Catalog

The catalog can now be used as a read-only data source. It can be used in notebooks or SQL queries like any other Unity Catalog table. As queries hit this new catalog, Databricks will route the corresponding queries to the Yellowbrick compute cluster.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Observability

Observability Metrics

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

Setting up Yellowbrick as a Source in Databricks Unity Catalog

Overview

Limitations

Prerequisites

Detailed Steps

Using the New Catalog