Using Python with Yellowbrick

In this tutorial, you will learn how to set up a connection to a Yellowbrick database from Python and pull data into a Pandas DataFrame. You will install and use PostgreSQL-compatible libraries for Python to set up the connection and query data in Yellowbrick.

Prerequisites

Yellowbrick credentials (host, port, username, password).
Pre-loaded NOAA data in your Yellowbrick instance (for example, in the noaa_ghcn_pds.observations table). If you need to load the NOAA sample data, see these instructions.

Overview

We will walk through the following steps:

Installing the required libraries.
Writing a Python script to connect to Yellowbrick and run SQL directly (using psycopg2).
Writing a Python script to query into Pandas (using pd.read_sql).

Part 1: Installing Python and Supporting Libraries

Linux (Ubuntu/Debian)

Install Python 3 and pip:

sudo apt update
sudo apt install python3 python3-pip

Install psycopg2 (PostgreSQL driver):

pip3 install psycopg2

Install Pandas and NumPy:

pip3 install pandas numpy

macOS

Install Python (via Homebrew):

brew install python

Install psycopg2:

pip3 install psycopg2

Install Pandas and NumPy:

pip3 install pandas numpy

Windows

Install Python:

Download from python.org or via the Microsoft Store. Make sure to select the “Add Python to PATH” option.

Install psycopg2:

pip install psycopg2

Install Pandas and NumPy:

pip install pandas numpy

If you already have Python, ensure you have the latest versions of psycopg2, pandas, and numpy.

Part 2: Query Yellowbrick from Python Using Pure SQL

Create a Python script, for example test.py, to connect to Yellowbrick and query the 10 most recent records:

import psycopg2

# Modify these with your actual Yellowbrick connection details
dbname = "yellowbrick_trial"   # Replace with your Yellowbrick database containing NOAA data
host = ""                      # Complete with your Yellowbrick hostname
port = 5432                    # Default Yellowbrick Postgres port
user = ""                      # Complete with your Yellowbrick username
password = ""                  # Complete with your Yellowbrick password

query = """
SELECT *
FROM noaa_ghcn_pds.observations
ORDER BY year_date DESC, id, element
LIMIT 10;
"""

# Connect to Yellowbrick using psycopg2
conn = psycopg2.connect(
    dbname=dbname,
    host=host,
    port=port,
    user=user,
    password=password
)

# Create a cursor and execute the query
cur = conn.cursor()
cur.execute(query)

# Fetch the results
rows = cur.fetchall()

# Get column names
column_names = [desc[0] for desc in cur.description]

# Print output
print("Most recent 10 records:")
print(column_names)
for row in rows:
    print(row)

# Clean up
cur.close()
conn.close()

How to run:

python test.py

Explanation:

We create a connection to Yellowbrick (which speaks PostgreSQL protocol).
We execute the SQL query ordering by year_date DESC and limiting to 10.
We fetch the results with fetchall().
We close the cursor and connection.

Part 3: Query Yellowbrick into a Pandas DataFrame

While the above approach works fine, you can also leverage Pandas for a simpler DataFrame workflow. Create a second script, for example test_pandas.py:

import psycopg2
import pandas as pd

# Modify these with your actual Yellowbrick connection details
dbname = "yellowbrick_trial"   # Replace with your Yellowbrick database containing NOAA data
host = ""                      # Complete with your Yellowbrick hostname
port = 5432
user = ""                      # Complete with your Yellowbrick username
password = ""                  # Complete with your Yellowbrick password

query = """
SELECT *
FROM noaa_ghcn_pds.observations
ORDER BY year_date DESC, id, element
LIMIT 10;
"""

# Connect using psycopg2
conn = psycopg2.connect(
    dbname=dbname,
    host=host,
    port=port,
    user=user,
    password=password
)

# Load directly into a Pandas DataFrame
df = pd.read_sql(query, conn)

# Print DataFrame
print("Dataframe of the 10 most recent records:")
print(df)

# Clean up connection
conn.close()

How to run:

python test_pandas.py

Explanation:

pd.read_sql automatically creates a DataFrame from the query results.
We see the same 10 records, but now in a Pandas DataFrame format.

Optional: sqlalchemy Approach

If you prefer, you can install sqlalchemy (pip3 install sqlalchemy) and use a connection URL like:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    'postgresql://<user>:<password>@<host>:<port>/<dbname>'
)

df = pd.read_sql("SELECT ...", engine)

But psycopg2 + read_sql is perfectly fine for straightforward use cases.

NOTE

You must override the Yellowbrick database version to use sqlalchemy:

ALTER SYSTEM SET override_version='PostgreSQL 9.5.10 on x86_64-redhat-linux-gnu';
SELECT pg_reload_conf();

Conclusion

You’ve successfully:

Installed Python and the PostgreSQL client libraries needed for Yellowbrick.
Connected to Yellowbrick using psycopg2 to run a direct SQL query.
Queried data from Yellowbrick into a Pandas DataFrame for further analysis or visualization.

With these building blocks, you can continue exploring analyses, visualizations (e.g., using matplotlib), and more advanced data manipulations using Python and Yellowbrick! Happy coding!

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB

Using Python with Yellowbrick

Prerequisites

Overview

Part 1: Installing Python and Supporting Libraries

Part 2: Query Yellowbrick from Python Using Pure SQL

Part 3: Query Yellowbrick into a Pandas DataFrame

Optional: sqlalchemy Approach

Conclusion