Calling `ybload` from AWS Lambda

This advanced tutorial explains how to create an AWS Lambda function that can load data into Yellowbrick every time a file is written into an S3 bucket. The Lambda function that you will create, uses the ybload command line tool to perform the data load. In a page AWS Lambda Data Loader, you can learn how to load data with AWS Lambda by using the LOAD TABLE command. The ybload tool supports more options than the LOAD TABLE command (e.g., the ability to perform upserts), and so is a useful approach when dealing with more complex load requirements.

Prerequisites

An AWS account with permissions to create and deploy resources to AWS using the AWS CLI.
Credentials to connect to a running Yellowbrick instance by using the command line tools.
The ybtools tarball that contains ybload. The easiest way to obtain this is by using "generic" package from the Yellowbrick Manager.
Git installed, to be able to clone the tutorial code.
OpenJDK installed, to be able to compile the tutorial code.
Maven, to be able to build the Lambda function package.

Overview

In this tutorial you will accomplish the following tasks:

Clone the tutorial sample code from GitHub.
Configure the deploy.sh script in the sample code with your AWS and Yellowbrick credentials.
Run deploy.sh to create and deploy the AWS Lambda function along with its dependencies.
Create a Yellowbrick table to be able to receive data from the Lambda function.
Load the data through a test CSV file into Yellowbrick by copying the file to an S3 bucket.
Observe the AWS CloudWatch logs to see the progress of the load.
Upsert data into the same Yellowbrick table.
Understand the strengths and limitations of this approach.

Part 1: Clone the Sample Code Repository

Clone the sample code in GitHub using:

sql

git clone git@github.com:markcusack/yellowbrick-learn.git

Part 2: Configure `deploy.sh`

Go to the yellowbrick-learn/learn/lambda-ybload folder. In a text editor, open deploy.sh and provide values for the unset variables below, and change others wherever needed:

bash

YBTOOLS_TGZ_NAME="ybtools-7.1.2-66379.e33141f2.generic.noarch.tar.gz"
YBLOAD_TGZ_NAME="${YBTOOLS_TGZ_NAME/ybtools/ybload}"

# S3 buckets for Lambda zip and parquet/csv files dropped in
ZIP_BUCKET=
LANDING_BUCKET=

# Change to suit your AWS account
ACCESS_KEY_ID=
SECRET_ACCESS_KEY=
SESSION_TOKEN=
REGION=
AWS_ACCOUNT=

# Yellowbrick database settings and target table
YB_HOST=
YB_USER=
YB_PASSWORD=
YB_DATABASE=
YB_TABLE=
YBLOAD_EXTRA_ARGS="--disable-trust"

# No need to change these unless a conflict
YBLOAD_FUNCTION="YBLoadFunction"
ROLE_NAME="lambda-s3-execution-role"
LAMBDA_JAVA_HOME="/var/lang"
JAR_NAME="ybload-lambda-1.0-SNAPSHOT.jar"
ZIP_NAME="ybload-lambda.zip"

On the first line, you need to set or confirm the name of the ybtools tarball. This should match the name and location of the file you downloaded as a part of the prerequisites.

You must create the following two buckets in AWS S3:

One that the Lambda function will load source data from.
One that will store the Java classes for the Lambda function you will create.

Create these buckets by using the AWS CLI, or through the AWS Console. Set the names of the buckets (e.g., ybload-source and ybload-zip, in deploy.sh).

Next, you must configure the AWS IDs and secrets that will be used by the Lambda function to read from S3. Create an AWS IAM User, ybload, and obtain AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN values for that user. Apply these values in deploy.sh. Also set the AWS region and AWS account in which you will deploy the Lambda function.

Set the Yellowbrick credentials. YB_HOST can be obtained from Yellowbrick Manager -> Instance Management-> Host/Port. Provide the username and password that ybload will use to connect to the Yellowbrick instance. You might need to create a new user, which can be done through Yellowbrick Manager -> Instance Management -> Access Control -> User.

Specify the database and table where data will be loaded. Use learn_ybload for the database name and public.target for the table name.

Part 3: Configure `deploy.sh`

Staying in the same folder, create your Lambda function:

bash

./deploy.sh

This script will perform the following:

Extract ybload and its dependencies from the ybtools tarball
Compile the Lambda function in the file YBLoadLambda.java
Upload the function and ybload into a zip file in the ybload-zip bucket
Create an IAM role and permissions for the Lambda function
Create and configure the function
Set up a trigger to call the function when a file lands in the ybload-source bucket.

Check the output of the script for errors, and correct before proceeding.

Part 4: Configure the Yellowbrick Target Database and Table

After setting up the Lambda function, create a target database and table to receive the data loaded by the Lambda function. For this, use the Query Editor in Yellowbrick Manager, ybsql, or your preferred SQL editor.

sql

CREATE DATABASE learn_ybload;

In the new database, create the target table:

sql

CREATE TABLE public.target(id INT PRIMARY KEY, value VARCHAR(128));

Part 5: Load Some Data

Create and save the following data into the file test.csv:

1, "row 1"
2, "row 2"
3, "row 3"

Copy the file to your S3 source bucket to trigger a data load:

aws s3 cp test.csv s3://ybload-source/

Part 6: Examine the Output of Lambda Function in CloudWatch

The Lambda function logs data loads in AWS CloudWatch. You access the logs from the command line by using the following:

aws logs tail /aws/lambda/YBLoadFunction --follow

You should see a log output that culminates in the following:

[ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s)

This signifies the data has been successfully loaded into Yellowbrick.

Part 7: Upserting Data

A powerful feature of ybload is its ability to perform upserts (i.e., insert new records or update existing records, avoid duplicates). To demonstrate this, create a new CSV file test2.csv with the following entries:

1, "row 1"
2, "row 2 altered"
4, "row 4"

Loading this data into the table should result in the following data in the target table:

1, "row 1"
2, "row 2 altered"
3, "row 3"
4, "row 4"

The result of loading this file is to ignore the duplicated first record, update the second record and add the fourth record. You need to configure the Lambda function before you can enable this upsert functionality.

Edit deploy.sh and set the following:

YBLOAD_EXTRA_ARGS="--disable-trust --write-op upsert --key-field-names id"

This informs ybload to use the id field as the key to ensure you avoid loading duplicate records. This also informs ybload to perform an upsert write operation. Run ./deploy.sh once more to update the Lambda function, then copy test2.csv into the ybload-source S3 bucket and observe the CloudWatch logs. Check that the load is completed without error.

In Yellowbrick Manager or ybsql run:

sql

SELECT * FROM target ORDER BY id;

The contents of the table should be:

1, "row 1"
2, "row 2 altered"
3, "row 3"
4, "row 4"

You have successfully upserted data into the table.

Part 8: Understanding the Strengths and Limitations

Using AWS Lambda with ybload provides a convenient, scalable, and robust approach to loading data into Yellowbrick in a way that avoids loading duplicate records. There are limitations to this approach; however, bear in mind that a Lambda function has a limited runtime of 15 minutes. This means the execution of the Lambda function (and your data load) will be stopped if the loading process takes over 15 minutes, resulting in no data being loaded. As a rule, assume that the data will load from S3 at the maximum rate of 70 MB/s on a compute cluster with one small-v2 node. Given this rate, the largest file that can be processed by the function within the Lambda runtime limit is around 60 GB.

Note: In practice, limit the maximum size of a single file to around 50 GB for this configuration.

Even given this limitation, it pays to exploit the horizontal scalability of AWS Lambda by landing many smaller files into the S3 source bucket more often. By doing this, a separate Lambda function will be invoked to process each file, massively increasing the load throughput and reducing the time for data to load in Yellowbrick.

Note: The number of load tasks that can execute at the same time is dictated by Yellowbrick's WLM rules and the network bandwidth available to the compute cluster that performs the load.

If Lambda load tasks get queued, there is a chance that the Lambda function will be halted before ybload tries to execute or complete its load. Consider creating a dedicated cluster to process loads from your Lambda function with a simple WLM profile. This WLM profile should have a single pool and an equal minumum and maximum concurrency. As a guide, for a cluster with a single small-v1 compute node, a minimum and maximum WLM concurrency of 8 will saturate the IO to and fro S3. Increase the size of the cluster to increase throughput.

The Lambda function you created cannot only load CSV data, it can also load compressed CSV and Parquet files. Drop compressed CSV files with a .csv.gz suffix, or Parquet files with a .parquet suffix into the S3 bucket to try it out.

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB

Calling `ybload` from AWS Lambda

Prerequisites

Overview

Part 1: Clone the Sample Code Repository

Part 2: Configure `deploy.sh`

Part 3: Configure `deploy.sh`

Part 4: Configure the Yellowbrick Target Database and Table

Part 5: Load Some Data

Part 6: Examine the Output of Lambda Function in CloudWatch

Part 7: Upserting Data

Part 8: Understanding the Strengths and Limitations