Skip to content

Using Python with Yellowbrick

In this tutorial, you will learn how to set up a connection to a Yellowbrick database from Python and pull data into a Pandas DataFrame. You will install and use PostgreSQL-compatible libraries for Python to set up the connection and query data in Yellowbrick.

Prerequisites

  • Yellowbrick credentials (host, port, username, password).
  • Pre-loaded NOAA data in your Yellowbrick instance (for example, in the noaa_ghcn_pds.observations table). If you need to load the NOAA sample data, see these instructions.

Overview

We will walk through the following steps:

  1. Installing the required libraries.
  2. Writing a Python script to connect to Yellowbrick and run SQL directly (using psycopg2).
  3. Writing a Python script to query into Pandas (using pd.read_sql).

Part 1: Installing Python and Supporting Libraries

Linux (Ubuntu/Debian)

Install Python 3 and pip:

sudo apt update
sudo apt install python3 python3-pip

Install psycopg2 (PostgreSQL driver):

pip3 install psycopg2

Install Pandas and NumPy:

pip3 install pandas numpy

macOS

Install Python (via Homebrew):

brew install python

Install psycopg2:

pip3 install psycopg2

Install Pandas and NumPy:

pip3 install pandas numpy

Windows

Install Python:

Download from python.org or via the Microsoft Store. Make sure to select the “Add Python to PATH” option.

Install psycopg2:

pip install psycopg2

Install Pandas and NumPy:

pip install pandas numpy

If you already have Python, ensure you have the latest versions of psycopg2, pandas, and numpy.

Part 2: Query Yellowbrick from Python Using Pure SQL

Create a Python script, for example test.py, to connect to Yellowbrick and query the 10 most recent records:

import psycopg2

# Modify these with your actual Yellowbrick connection details
dbname = "yellowbrick_trial"   # Replace with your Yellowbrick database containing NOAA data
host = ""                      # Complete with your Yellowbrick hostname
port = 5432                    # Default Yellowbrick Postgres port
user = ""                      # Complete with your Yellowbrick username
password = ""                  # Complete with your Yellowbrick password

query = """
SELECT *
FROM noaa_ghcn_pds.observations
ORDER BY year_date DESC, id, element
LIMIT 10;
"""

# Connect to Yellowbrick using psycopg2
conn = psycopg2.connect(
    dbname=dbname,
    host=host,
    port=port,
    user=user,
    password=password
)

# Create a cursor and execute the query
cur = conn.cursor()
cur.execute(query)

# Fetch the results
rows = cur.fetchall()

# Get column names
column_names = [desc[0] for desc in cur.description]

# Print output
print("Most recent 10 records:")
print(column_names)
for row in rows:
    print(row)

# Clean up
cur.close()
conn.close()

How to run:

python test.py

Explanation:

  • We create a connection to Yellowbrick (which speaks PostgreSQL protocol).
  • We execute the SQL query ordering by year_date DESC and limiting to 10.
  • We fetch the results with fetchall().
  • We close the cursor and connection.

Part 3: Query Yellowbrick into a Pandas DataFrame

While the above approach works fine, you can also leverage Pandas for a simpler DataFrame workflow. Create a second script, for example test_pandas.py:

import psycopg2
import pandas as pd

# Modify these with your actual Yellowbrick connection details
dbname = "yellowbrick_trial"   # Replace with your Yellowbrick database containing NOAA data
host = ""                      # Complete with your Yellowbrick hostname
port = 5432
user = ""                      # Complete with your Yellowbrick username
password = ""                  # Complete with your Yellowbrick password

query = """
SELECT *
FROM noaa_ghcn_pds.observations
ORDER BY year_date DESC, id, element
LIMIT 10;
"""

# Connect using psycopg2
conn = psycopg2.connect(
    dbname=dbname,
    host=host,
    port=port,
    user=user,
    password=password
)

# Load directly into a Pandas DataFrame
df = pd.read_sql(query, conn)

# Print DataFrame
print("Dataframe of the 10 most recent records:")
print(df)

# Clean up connection
conn.close()

How to run:

python test_pandas.py

Explanation:

  • pd.read_sql automatically creates a DataFrame from the query results.
  • We see the same 10 records, but now in a Pandas DataFrame format.

Optional: sqlalchemy Approach

If you prefer, you can install sqlalchemy (pip3 install sqlalchemy) and use a connection URL like:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    'postgresql://<user>:<password>@<host>:<port>/<dbname>'
)

df = pd.read_sql("SELECT ...", engine)

But psycopg2 + read_sql is perfectly fine for straightforward use cases.

NOTE

You must override the Yellowbrick database version to use sqlalchemy:

ALTER SYSTEM SET override_version='PostgreSQL 9.5.10 on x86_64-redhat-linux-gnu';
SELECT pg_reload_conf();

Conclusion

You’ve successfully:

  1. Installed Python and the PostgreSQL client libraries needed for Yellowbrick.
  2. Connected to Yellowbrick using psycopg2 to run a direct SQL query.
  3. Queried data from Yellowbrick into a Pandas DataFrame for further analysis or visualization.

With these building blocks, you can continue exploring analyses, visualizations (e.g., using matplotlib), and more advanced data manipulations using Python and Yellowbrick! Happy coding!