Copying and Syncing Databricks to Yellowbrick

Overview

The Databricks Notebook DB2YB can be used to copy tables from Databricks into Yellowbrick and keep them synchronized. This utility is designed to simplify using Yellowbrick as the Gold layer in the Databricks medallion architecture as detailed here: Gold Layer. This document provides a tutorial on usage. Detailed usage instructions and source code can be found at the GitHub page: DB2YB github.

Given a list of table names (wildcard matching with regular expression), it can:

Read the Databricks table definitions
Transform the DDL into Yellowbrick format
Create the tables in Yellowbrick
Move the data in the table to Yellowbrick
- Optionally using S3 for high-performance parallel loads
- Limit and Where clauses can be specified
- Append mode is available for syncing tables
Validate the data transferred properly (via matching row counts)
Optionally: Create a shadow schema with history of execution and databricks lineage information

Prerequisites

The Notebook is designed to run in the Databricks environment
If Yellowbrick is installed in a private VPC, the VPCs for Yellowbrick and Databricks must be peered. See VPC Peering for more information.
An S3 Bucket set up with an API Access Key if useS3 is true. Using S3 speeds up large data transfers by over 20x.

NOTE: You must use a compute cluster (not serverless) to run the notebook. The cluster should be:

Dedicated
Runtime Version: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
"Use Photon" enabled
Compute node type: Depends on source data size and performance requirements

Running the Code Interactively

The notebook is divided into separate code blocks. This section explains each block.

Code Block 1: Initialization Info

Edit this section to include connection info for Databricks, the S3 bucket for temporary storage, and the Yellowbrick database instance.

Lines 1-14: Edit with your connection information (examples are provided in comments).
Lines 16-17: Typically computed automatically from the values entered above.
Line 19: Controls the number of CSV files created during the extract. A value of 20 is typically good for ~100 million rows. Adjust as needed.

Constant	Description	Example
`YB_HOST`	Fully qualified Yellowbrick host name	`ybhost.elb.us-east-2.amazonaws.com`
`YB_USER`	Yellowbrick username	`JoeDBA@yellowbrickcloud.com`
`YB_PASSWORD`	Yellowbrick password	`SuperS3cret!`
`YB_DATABASE`	Yellowbrick database name	`prod_db`
`YB_SCHEMA`	Yellowbrick schema	`gold`
`DB_DATABASE`	Databricks database/catalog name	`corp`
`DB_SCHEMA`	Databricks schema	`silver`
`DB_LINEAGE_SCHEMA`	Databricks Lineage Schema	`yb_gold_lineage`
`AWS_REGION`	AWS region name	`us-east-1`
`AWS_ACCESS_KEY_ID`	AWS S3 access key ID	`AVIAGI4VJJXF34LCFSN2`
`AWS_SECRET_KEY`	AWS S3 secret access key	`0Ps+NeWy1uf5cXfzg8qZoABdwv9oBbJh2Q0n2pB4`
`BUCKET_NAME`	S3 bucket name	`my_bucket`
`SPARK_PARTITIONS`	Number of spark partitions	Increase if using more nodes
`MIN_S3_ROWS`	Minimum number of rows to use S3	Using S3 has a startup time. For small tables, spark write is faster

Code Block 2: DB2YB Code

Defines the DB2YB function. Running this block defines the function but does not execute it.

Code Blocks 3-7: Examples

Sample calls to DB2YB. You can edit and run these examples based on your specific requirements.

Main Command: DB2YB Function

The notebook defines a DB2YB function that accepts named parameters to control operation.

Arguments

Argument	Description	Example
`table_patterns`	A python list of regular expression patterns for table matching	`table_patterns=['wid.*','big_table']`
`limit`	Limits the number of rows transferred to the number specified	`limit=20`
`where`	Adds a WHERE clause to limit rows (see Predicate Specification below)	`where='[Hour]<12'`
`write_ddl`	Writes DDLs to the specified directory. Directory must exist.	`write_ddl='./MyDDLs'`
`read_ddl`	Reads DDLs from the specified directory	`read_ddl='./MyDDLs'`
`append`	Activates append mode (details below)	`append='[Hour]>=12'`
`useS3`	This defaults to True. Falls back on individual tables if too small. Set to false to force using spark writes	`useS3=False`

Predicate Specification

Predicates must run in both Yellowbrick and Databricks, so:

Column names should be enclosed in []
Special characters can be escaped using url encoding (for example, ")

Important: The same predicate is applied on both sides (Databricks and Yellowbrick). Ensure it is correct, especially in append mode.

Examples:

python

DB2YB(table_pattern=['HourlySummary','HourlyDetail'], where='[hour] between 4 and 10')

DB2YB(table_pattern='.*', limit=1000000)

Append Mode

Append mode is used to insert only new or updated data into existing Yellowbrick tables. Key behaviors:

Does not drop or recreate the table
Extracts only data matching the specified WHERE clause from Databricks
Deletes matching rows in Yellowbrick to prevent duplicates
Loads the new data

Running as a Workflow (Non-interactively)

To run this non-interactively, you should delete the sample code blocks (3-7) and replace with the DB2YB command most relevant to your workflow. An example that find the most recent completed hour and appends only that hour of data to Yellowbrick, Below is the code to replace blocks 3-7:

from datetime import datetime, timedelta
Hour = datetime.now() - timedelta(hours=1) 
DB2YB(table_patterns=['HourlyDetails'], append=f"[Hour]={Hour.hour}")

Notes and Tips

For large tables, consider increasing the SPARK_PARTITIONS constant for better performance.
Default file format is CSV.
Always double-check schema and database targeting before running destructive operations.
Monitor both S3 and Yellowbrick after completion to validate successful ingestion.

Troubleshooting

S3 Errors: Check IAM permissions and AWS region settings.
Database Errors: Verify Yellowbrick IP address access rules, login credentials, and schema permissions. Refer to the Yellowbrick VPC Peering guide if needed.
Large Table Memory Errors: Repartition the Spark DataFrame into more pieces.

Workload Management

Distributing Data

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Observability

Observability Metrics

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Observability

Observability Alerts

Observability Metrics

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

Creating WLM Resource Pools

Creating WLM Rules

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

Copying and Syncing Databricks to Yellowbrick

Overview

Prerequisites

Running the Code Interactively

Code Block 1: Initialization Info

Code Block 2: DB2YB Code

Code Blocks 3-7: Examples

Main Command: DB2YB Function

Arguments

Predicate Specification

Append Mode

Running as a Workflow (Non-interactively)

Notes and Tips

Troubleshooting