Appearance
Using R with Yellowbrick
In this tutorial, you will learn how to set up a connection to a Yellowbrick database from a script in R, and pull data into a DataFrame. You will install and use the PostgreSQL libraries for R to set up the connection and access data.
Prerequisites
- Credentials to connect to a running Yellowbrick instance by using the command line tools.
- Pre-loaded NOAA data. To load the NOAA data, go to page Using Ask TK to Generate SQL, Part 2: Loading the Sample Data and follow the steps to load the data using Yellowbrick Manager.
Overview
In this tutorial you will accomplish the following tasks:
- Install the required libraries.
- Write a script in R to directly query Yellowbrick using SQL.
- Write a script in R that uses
dplyr
to query Yellowbrick.
Part 1: Installing R and Supporting Libraries
If installing R on a Linux client (e.g. Ubuntu), install the following:
- Install R:
sudo apt install r-base
- Install libpq:
sudo apt install libpq-dev
If installing R on MacOS, install the following:
- Install R:
brew install r
- Install libpq:
brew install libpq
If installing R on Windows, install the following:
- Install R: See instructions
- Install PostgreSQL CLI tools
Part 2: Query Yellowbrick from R using SQL
Create an R script, test.r
, that will connect to your Yellowbrick instance and extract the most recent 10 records into a DataFrame:
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("RPostgres")
install.packages("DBI")
library(DBI)
library(RPostgres)
dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password
query <- "SELECT * FROM noaa_ghcn_pds.observations ORDER BY year_date DESC, id, element LIMIT 10;"
con <- dbConnect(Postgres(),
dbname = dbname,
host = host,
port = port,
user = user,
password = password,
sslmode = "require")
result <- dbGetQuery(con, query)
print(result)
dbDisconnect(con)
Run the script as follows:
Rscript test.r
Note: You might have to run
sudo Rscript test.r
for the first time to download and install the packages system-wide on Linux. You can comment out theoptions
andinstall
lines in the script for subsequent executions.
You will see output of the form:
id year_date element data_value m_flag q_flag s_flag obs_time
1 BF1CS000002 20250223 PRCP 0 <NA> <NA> N <NA>
2 BF1CS000002 20250223 SNOW 0 <NA> <NA> N <NA>
3 BF1FP000001 20250223 PRCP 0 <NA> <NA> N <NA>
4 BF1FP000001 20250223 SNOW 0 <NA> <NA> N <NA>
5 BF1NP000001 20250223 PRCP 0 <NA> <NA> N <NA>
6 BF1NP000001 20250223 SNOW 0 <NA> <NA> N <NA>
7 BF1NP000011 20250223 PRCP 0 <NA> <NA> N <NA>
8 BF1NP000011 20250223 SNOW 0 <NA> <NA> N <NA>
9 BF1NP000013 20250223 PRCP 0 <NA> <NA> N <NA>
10 BF1NP000013 20250223 SNOW 0 <NA> <NA> N <NA>
Part 3: Query Yellowbrick from R using dplyr
Yellowbrick also supports dplyr
as a SQL abstraction layer in R. Create a file test2.r
containing the following:
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("dplyr")
install.packages("dbplyr")
install.packages("DBI")
install.packages("RPostgres")
library(DBI)
library(RPostgres)
library(dplyr)
library(dbplyr)
dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password
con <- dbConnect(Postgres(),
dbname = dbname,
host = host,
port = port,
user = user,
password = password,
sslmode = "require")
observations <- tbl(con, in_schema("noaa_ghcn_pds", "observations"))
query_result <- observations %>%
arrange(desc(year_date), id, element) %>%
head(10) %>%
collect()
print(query_result)
dbDisconnect(con)
You will see output in the following form:
id year_date element data_value m_flag q_flag s_flag obs_time
<chr> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
1 BF1CS000002 20250223 PRCP 0 NA NA N NA
2 BF1CS000002 20250223 SNOW 0 NA NA N NA
3 BF1FP000001 20250223 PRCP 0 NA NA N NA
4 BF1FP000001 20250223 SNOW 0 NA NA N NA
5 BF1NP000001 20250223 PRCP 0 NA NA N NA
6 BF1NP000001 20250223 SNOW 0 NA NA N NA
7 BF1NP000011 20250223 PRCP 0 NA NA N NA
8 BF1NP000011 20250223 SNOW 0 NA NA N NA
9 BF1NP000013 20250223 PRCP 0 NA NA N NA
10 BF1NP000013 20250223 SNOW 0 NA NA N NA
Now, the R script extracts the most recent 10 records from the Yellowbrick table, this time by using dplyr
.
Conclusion
You have learned how to install the R environment and connect to Yellowbrick to extract data into an R DataFrame from a table in Yellowbrick.
Happy calculating!