Skip to content

Copying and Syncing Databricks to Yellowbrick

Overview

The Databricks Notebook DB2YB can be used to copy tables from Databricks into Yellowbrick and keep them synchronized. This utility is designed to simplify using Yellowbrick as the Gold layer in the Databricks medallion architecture as detailed here: Gold Layer. This document provides a tutorial on usage. Detailed usage instructions and source code can be found at the GitHub page: DB2YB github.

Given a list of table names (wildcard matching with regular expression), it can:

  • Read the Databricks table definitions
  • Transform the DDL into Yellowbrick format
  • Create the tables in Yellowbrick
  • Move the data in the table to Yellowbrick
    • Optionally using S3 for high-performance parallel loads
    • Limit and Where clauses can be specified
    • Append mode is available for syncing tables
  • Validate the data transferred properly (via matching row counts)
  • Optionally: Create a shadow schema with history of execution and databricks lineage information

Prerequisites

  • The Notebook is designed to run in the Databricks environment
  • If Yellowbrick is installed in a private VPC, the VPCs for Yellowbrick and Databricks must be peered. See VPC Peering for more information.
  • An S3 Bucket set up with an API Access Key if useS3 is true. Using S3 speeds up large data transfers by over 20x.

NOTE: You must use a compute cluster (not serverless) to run the notebook. The cluster should be:

  • Dedicated
  • Runtime Version: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
  • "Use Photon" enabled
  • Compute node type: Depends on source data size and performance requirements

Running the Code Interactively

The notebook is divided into separate code blocks. This section explains each block.

Code Block 1: Initialization Info

Edit this section to include connection info for Databricks, the S3 bucket for temporary storage, and the Yellowbrick database instance.

  • Lines 1-14: Edit with your connection information (examples are provided in comments).
  • Lines 16-17: Typically computed automatically from the values entered above.
  • Line 19: Controls the number of CSV files created during the extract. A value of 20 is typically good for ~100 million rows. Adjust as needed.
ConstantDescriptionExample
YB_HOSTFully qualified Yellowbrick host nameybhost.elb.us-east-2.amazonaws.com
YB_USERYellowbrick usernameJoeDBA@yellowbrickcloud.com
YB_PASSWORDYellowbrick passwordSuperS3cret!
YB_DATABASEYellowbrick database nameprod_db
YB_SCHEMAYellowbrick schemagold
DB_DATABASEDatabricks database/catalog namecorp
DB_SCHEMADatabricks schemasilver
DB_LINEAGE_SCHEMADatabricks Lineage Schemayb_gold_lineage
AWS_REGIONAWS region nameus-east-1
AWS_ACCESS_KEY_IDAWS S3 access key IDAVIAGI4VJJXF34LCFSN2
AWS_SECRET_KEYAWS S3 secret access key0Ps+NeWy1uf5cXfzg8qZoABdwv9oBbJh2Q0n2pB4
BUCKET_NAMES3 bucket namemy_bucket
SPARK_PARTITIONSNumber of spark partitionsIncrease if using more nodes
MIN_S3_ROWSMinimum number of rows to use S3Using S3 has a startup time. For small tables, spark write is faster

Code Block 2: DB2YB Code

Defines the DB2YB function. Running this block defines the function but does not execute it.

Code Blocks 3-7: Examples

Sample calls to DB2YB. You can edit and run these examples based on your specific requirements.

Main Command: DB2YB Function

The notebook defines a DB2YB function that accepts named parameters to control operation.

Arguments

ArgumentDescriptionExample
table_patternsA python list of regular expression patterns for table matchingtable_patterns=['wid.*','big_table']
limitLimits the number of rows transferred to the number specifiedlimit=20
whereAdds a WHERE clause to limit rows (see Predicate Specification below)where='[Hour]<12'
write_ddlWrites DDLs to the specified directory. Directory must exist.write_ddl='./MyDDLs'
read_ddlReads DDLs from the specified directoryread_ddl='./MyDDLs'
appendActivates append mode (details below)append='[Hour]>=12'
useS3This defaults to True. Falls back on individual tables if too small. Set to false to force using spark writesuseS3=False

Predicate Specification

Predicates must run in both Yellowbrick and Databricks, so:

  • Column names should be enclosed in []
  • Special characters can be escaped using url encoding (for example, &quot;)

Important: The same predicate is applied on both sides (Databricks and Yellowbrick). Ensure it is correct, especially in append mode.

Examples:

python
DB2YB(table_pattern=['HourlySummary','HourlyDetail'], where='[hour] between 4 and 10')

DB2YB(table_pattern='.*', limit=1000000)

Append Mode

Append mode is used to insert only new or updated data into existing Yellowbrick tables. Key behaviors:

  • Does not drop or recreate the table
  • Extracts only data matching the specified WHERE clause from Databricks
  • Deletes matching rows in Yellowbrick to prevent duplicates
  • Loads the new data

Running as a Workflow (Non-interactively)

To run this non-interactively, you should delete the sample code blocks (3-7) and replace with the DB2YB command most relevant to your workflow. An example that find the most recent completed hour and appends only that hour of data to Yellowbrick, Below is the code to replace blocks 3-7:

from datetime import datetime, timedelta
Hour = datetime.now() - timedelta(hours=1) 
DB2YB(table_patterns=['HourlyDetails'], append=f"[Hour]={Hour.hour}")

Notes and Tips

  • For large tables, consider increasing the SPARK_PARTITIONS constant for better performance.
  • Default file format is CSV.
  • Always double-check schema and database targeting before running destructive operations.
  • Monitor both S3 and Yellowbrick after completion to validate successful ingestion.

Troubleshooting

  • S3 Errors: Check IAM permissions and AWS region settings.
  • Database Errors: Verify Yellowbrick IP address access rules, login credentials, and schema permissions. Refer to the Yellowbrick VPC Peering guide if needed.
  • Large Table Memory Errors: Repartition the Spark DataFrame into more pieces.