Appearance
Setting up Yellowbrick as a Source in Databricks Unity Catalog
Overview
Databricks Unity Catalog allows users to map connections to external data sources, such as Yellowbrick. This enables queries from a Databricks notebook or SQL Warehouse into the Yellowbrick database. One of the most common uses for this mapping is for data scientists to bring cleansed and vetted "Gold" data from Yellowbrick back into Databricks to generate predictive models or run AI/ML pipelines.
Limitations
- DDL and DML statements are not permitted (e.g.,
INSERT
,UPDATE
,DELETE
,CREATE
,DROP
). - Joins are not pushed down to Yellowbrick. All data involved in joins is moved to Databricks and joined there.
Prerequisites
- If Yellowbrick is installed in a private VPC, the VPCs for Yellowbrick and Databricks must be peered. See VPC Peering for more information.
Detailed Steps
From the Catalog sidebar in the Databricks console, click the
+
button, then select Create a Connection.On the Connection Basics tab:
- Name:
[Name for your connection]
- Connection Type:
PostgreSQL
- Click Next.
- Name:
On the Authentication tab:
- Host:
[Your FQDN for Yellowbrick Instance, e.g., yb_prod.elb.us-east-1.amazonaws.com]
- Port:
5432
(or your custom port if changed) - User:
[Username]
- Password:
[Password]
- Click Next.
- Host:
On the Catalog Basics tab:
- Catalog Name:
[Name for the new DB Catalog representing the connection]
- Database:
[Yellowbrick Database Name]
- Click Test:
- Choose your compute if one exists, or start a serverless compute instance.
- Ensure the connection is successful.
- Click Create Catalog.
- Catalog Name:
On the Access tab:
- Keep the defaults or modify as needed.
- Click Next.
On the Metadata tab:
- Add metadata for Databricks documentation if needed.
- Click Save.
Using the New Catalog
The catalog can now be used as a read-only data source. It can be referenced via notebooks or SQL queries like any other table in the Databricks Unity Catalog. As queries hit this new catalog, you should see the corresponding queries executed against the Yellowbrick compute cluster.