Skip to content

Using R with Yellowbrick

In this tutorial, you will learn how to set up a connection to a Yellowbrick database from a script in R, and pull data into a DataFrame. You will install and use the PostgreSQL libraries for R to set up the connection and access data.

Prerequisites

Overview

In this tutorial you will accomplish the following tasks:

  1. Install the required libraries.
  2. Write a script in R to directly query Yellowbrick using SQL.
  3. Write a script in R that uses dplyr to query Yellowbrick.

Part 1: Installing R and Supporting Libraries

If installing R on a Linux client (e.g. Ubuntu), install the following:

  • Install R: sudo apt install r-base
  • Install libpq: sudo apt install libpq-dev

If installing R on MacOS, install the following:

  • Install R: brew install r
  • Install libpq: brew install libpq

If installing R on Windows, install the following:

Part 2: Query Yellowbrick from R using SQL

Create an R script, test.r, that will connect to your Yellowbrick instance and extract the most recent 10 records into a DataFrame:

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("RPostgres")
install.packages("DBI")

library(DBI)
library(RPostgres)

dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password
query <- "SELECT * FROM noaa_ghcn_pds.observations ORDER BY year_date DESC, id, element LIMIT 10;"

con <- dbConnect(Postgres(),
                 dbname = dbname,
                 host = host,
                 port = port,
                 user = user,
                 password = password,
                 sslmode = "require")

result <- dbGetQuery(con, query)

print(result)

dbDisconnect(con)

Run the script as follows:

Rscript test.r

Note: You might have to run sudo Rscript test.r for the first time to download and install the packages system-wide on Linux. You can comment out the options and install lines in the script for subsequent executions.

You will see output of the form:

            id year_date element data_value m_flag q_flag s_flag obs_time
1  BF1CS000002  20250223    PRCP          0   <NA>   <NA>      N     <NA>
2  BF1CS000002  20250223    SNOW          0   <NA>   <NA>      N     <NA>
3  BF1FP000001  20250223    PRCP          0   <NA>   <NA>      N     <NA>
4  BF1FP000001  20250223    SNOW          0   <NA>   <NA>      N     <NA>
5  BF1NP000001  20250223    PRCP          0   <NA>   <NA>      N     <NA>
6  BF1NP000001  20250223    SNOW          0   <NA>   <NA>      N     <NA>
7  BF1NP000011  20250223    PRCP          0   <NA>   <NA>      N     <NA>
8  BF1NP000011  20250223    SNOW          0   <NA>   <NA>      N     <NA>
9  BF1NP000013  20250223    PRCP          0   <NA>   <NA>      N     <NA>
10 BF1NP000013  20250223    SNOW          0   <NA>   <NA>      N     <NA>

Part 3: Query Yellowbrick from R using dplyr

Yellowbrick also supports dplyr as a SQL abstraction layer in R. Create a file test2.r containing the following:

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("dplyr")
install.packages("dbplyr")
install.packages("DBI")
install.packages("RPostgres")

library(DBI)
library(RPostgres)
library(dplyr)
library(dbplyr)

dbname <- "yellowbrick_trial" # Replace with the database containing the NOAA data
host <- "" # Complete with your Yellowbrick hostname
port <- 5432
user <- "" # Complete with your Yellowbrick username
password <- "" # Complete with your Yellowbrick password

con <- dbConnect(Postgres(),
                 dbname = dbname,
                 host = host,
                 port = port,
                 user = user,
                 password = password,
                 sslmode = "require")

observations <- tbl(con, in_schema("noaa_ghcn_pds", "observations"))

query_result <- observations %>%
  arrange(desc(year_date), id, element) %>%
  head(10) %>%
  collect()

print(query_result)

dbDisconnect(con)

You will see output in the following form:

   id          year_date element data_value m_flag q_flag s_flag obs_time
   <chr>           <int> <chr>        <dbl> <chr>  <chr>  <chr>  <chr>
 1 BF1CS000002  20250223 PRCP             0 NA     NA     N      NA
 2 BF1CS000002  20250223 SNOW             0 NA     NA     N      NA
 3 BF1FP000001  20250223 PRCP             0 NA     NA     N      NA
 4 BF1FP000001  20250223 SNOW             0 NA     NA     N      NA
 5 BF1NP000001  20250223 PRCP             0 NA     NA     N      NA
 6 BF1NP000001  20250223 SNOW             0 NA     NA     N      NA
 7 BF1NP000011  20250223 PRCP             0 NA     NA     N      NA
 8 BF1NP000011  20250223 SNOW             0 NA     NA     N      NA
 9 BF1NP000013  20250223 PRCP             0 NA     NA     N      NA
10 BF1NP000013  20250223 SNOW             0 NA     NA     N      NA

Now, the R script extracts the most recent 10 records from the Yellowbrick table, this time by using dplyr.

Conclusion

You have learned how to install the R environment and connect to Yellowbrick to extract data into an R DataFrame from a table in Yellowbrick.

Happy calculating!