Using R with Yellowbrick

In this tutorial, you will learn how to set up a connection to a Yellowbrick database from a script in R, and pull data into a DataFrame. You will install and use the PostgreSQL libraries for R to set up the connection and access data.

Prerequisites

Credentials to connect to a running Yellowbrick instance by using the command line tools.
Pre-loaded NOAA data. To load the NOAA data, go to page Using Ask TK to Generate SQL, Part 2: Loading the Sample Data and follow the steps to load the data using Yellowbrick Manager.

Overview

In this tutorial you will accomplish the following tasks:

Install the required libraries.
Write a script in R to directly query Yellowbrick using SQL.
Write a script in R that uses dplyr to query Yellowbrick.

Part 1: Installing R and Supporting Libraries

If installing R on a Linux client (e.g. Ubuntu), install the following:

Install R: sudo apt install r-base
Install libpq: sudo apt install libpq-dev

If installing R on MacOS, install the following:

Install R: brew install r
Install libpq: brew install libpq

If installing R on Windows, install the following:

Install R: See instructions
Install PostgreSQL CLI tools

Part 2: Query Yellowbrick from R using SQL

Create an R script, test.r, that will connect to your Yellowbrick instance and extract the most recent 10 records into a DataFrame:

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("RPostgres")
install.packages("DBI")

library(DBI)
library(RPostgres)

dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password
query <- "SELECT * FROM noaa_ghcn_pds.observations ORDER BY year_date DESC, id, element LIMIT 10;"

con <- dbConnect(Postgres(),
                 dbname = dbname,
                 host = host,
                 port = port,
                 user = user,
                 password = password,
                 sslmode = "require")

result <- dbGetQuery(con, query)

print(result)

dbDisconnect(con)

Run the script as follows:

Rscript test.r

Note: You might have to run sudo Rscript test.r for the first time to download and install the packages system-wide on Linux. You can comment out the options and install lines in the script for subsequent executions.

You will see output of the form:

            id year_date element data_value m_flag q_flag s_flag obs_time
1  BF1CS000002  20250223    PRCP          0   <NA>   <NA>      N     <NA>
2  BF1CS000002  20250223    SNOW          0   <NA>   <NA>      N     <NA>
3  BF1FP000001  20250223    PRCP          0   <NA>   <NA>      N     <NA>
4  BF1FP000001  20250223    SNOW          0   <NA>   <NA>      N     <NA>
5  BF1NP000001  20250223    PRCP          0   <NA>   <NA>      N     <NA>
6  BF1NP000001  20250223    SNOW          0   <NA>   <NA>      N     <NA>
7  BF1NP000011  20250223    PRCP          0   <NA>   <NA>      N     <NA>
8  BF1NP000011  20250223    SNOW          0   <NA>   <NA>      N     <NA>
9  BF1NP000013  20250223    PRCP          0   <NA>   <NA>      N     <NA>
10 BF1NP000013  20250223    SNOW          0   <NA>   <NA>      N     <NA>

Part 3: Query Yellowbrick from R using `dplyr`

Yellowbrick also supports dplyr as a SQL abstraction layer in R. Create a file test2.r containing the following:

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("dplyr")
install.packages("dbplyr")
install.packages("DBI")
install.packages("RPostgres")

library(DBI)
library(RPostgres)
library(dplyr)
library(dbplyr)

dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password

con <- dbConnect(Postgres(),
                 dbname = dbname,
                 host = host,
                 port = port,
                 user = user,
                 password = password,
                 sslmode = "require")

observations <- tbl(con, in_schema("noaa_ghcn_pds", "observations"))

query_result <- observations %>%
  arrange(desc(year_date), id, element) %>%
  head(10) %>%
  collect()

print(query_result)

dbDisconnect(con)

You will see output in the following form:

   id          year_date element data_value m_flag q_flag s_flag obs_time
   <chr>           <int> <chr>        <dbl> <chr>  <chr>  <chr>  <chr>
 1 BF1CS000002  20250223 PRCP             0 NA     NA     N      NA
 2 BF1CS000002  20250223 SNOW             0 NA     NA     N      NA
 3 BF1FP000001  20250223 PRCP             0 NA     NA     N      NA
 4 BF1FP000001  20250223 SNOW             0 NA     NA     N      NA
 5 BF1NP000001  20250223 PRCP             0 NA     NA     N      NA
 6 BF1NP000001  20250223 SNOW             0 NA     NA     N      NA
 7 BF1NP000011  20250223 PRCP             0 NA     NA     N      NA
 8 BF1NP000011  20250223 SNOW             0 NA     NA     N      NA
 9 BF1NP000013  20250223 PRCP             0 NA     NA     N      NA
10 BF1NP000013  20250223 SNOW             0 NA     NA     N      NA

Now, the R script extracts the most recent 10 records from the Yellowbrick table, this time by using dplyr.

Conclusion

You have learned how to install the R environment and connect to Yellowbrick to extract data into an R DataFrame from a table in Yellowbrick.

Happy calculating!

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB

Using R with Yellowbrick

Prerequisites

Overview

Part 1: Installing R and Supporting Libraries

Part 2: Query Yellowbrick from R using SQL

Part 3: Query Yellowbrick from R using `dplyr`

Conclusion