Skip to content

Calling ybload from AWS Lambda

This advanced tutorial explains how to create an AWS Lambda function that can load data into Yellowbrick every time a file is written into an S3 bucket. The Lambda function that you will create, uses the ybload command line tool to perform the data load. In a page AWS Lambda Data Loader, you can learn how to load data with AWS Lambda by using the LOAD TABLE command. The ybload tool supports more options than the LOAD TABLE command (e.g., the ability to perform upserts), and so is a useful approach when dealing with more complex load requirements.

Prerequisites

  • An AWS account with permissions to create and deploy resources to AWS using the AWS CLI.
  • Credentials to connect to a running Yellowbrick instance by using the command line tools.
  • The ybtools tarball that contains ybload. The easiest way to obtain this is by using "generic" package from the Yellowbrick Manager.
  • Git installed, to be able to clone the tutorial code.
  • OpenJDK installed, to be able to compile the tutorial code.
  • Maven, to be able to build the Lambda function package.

Overview

In this tutorial you will accomplish the following tasks:

  1. Clone the tutorial sample code from GitHub.
  2. Configure the deploy.sh script in the sample code with your AWS and Yellowbrick credentials.
  3. Run deploy.sh to create and deploy the AWS Lambda function along with its dependencies.
  4. Create a Yellowbrick table to be able to receive data from the Lambda function.
  5. Load the data through a test CSV file into Yellowbrick by copying the file to an S3 bucket.
  6. Observe the AWS CloudWatch logs to see the progress of the load.
  7. Upsert data into the same Yellowbrick table.
  8. Understand the strengths and limitations of this approach.

Part 1: Clone the Sample Code Repository

Clone the sample code in GitHub using:

sql
git clone git@github.com:markcusack/yellowbrick-learn.git

Part 2: Configure deploy.sh

Go to the yellowbrick-learn/learn/lambda-ybload folder. In a text editor, open deploy.sh and provide values for the unset variables below, and change others wherever needed:

bash
YBTOOLS_TGZ_NAME="ybtools-7.1.2-66379.e33141f2.generic.noarch.tar.gz"
YBLOAD_TGZ_NAME="${YBTOOLS_TGZ_NAME/ybtools/ybload}"

# S3 buckets for Lambda zip and parquet/csv files dropped in
ZIP_BUCKET=
LANDING_BUCKET=

# Change to suit your AWS account
ACCESS_KEY_ID=
SECRET_ACCESS_KEY=
SESSION_TOKEN=
REGION=
AWS_ACCOUNT=

# Yellowbrick database settings and target table
YB_HOST=
YB_USER=
YB_PASSWORD=
YB_DATABASE=
YB_TABLE=
YBLOAD_EXTRA_ARGS="--disable-trust"

# No need to change these unless a conflict
YBLOAD_FUNCTION="YBLoadFunction"
ROLE_NAME="lambda-s3-execution-role"
LAMBDA_JAVA_HOME="/var/lang"
JAR_NAME="ybload-lambda-1.0-SNAPSHOT.jar"
ZIP_NAME="ybload-lambda.zip"

On the first line, you need to set or confirm the name of the ybtools tarball. This should match the name and location of the file you downloaded as a part of the prerequisites.

You must create the following two buckets in AWS S3:

  • One that the Lambda function will load source data from.
  • One that will store the Java classes for the Lambda function you will create.

Create these buckets by using the AWS CLI, or through the AWS Console. Set the names of the buckets (e.g., ybload-source and ybload-zip, in deploy.sh).

Next, you must configure the AWS IDs and secrets that will be used by the Lambda function to read from S3. Create an AWS IAM User, ybload, and obtain AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN values for that user. Apply these values in deploy.sh. Also set the AWS region and AWS account in which you will deploy the Lambda function.

Set the Yellowbrick credentials. YB_HOST can be obtained from Yellowbrick Manager -> Instance Management-> Host/Port. Provide the username and password that ybload will use to connect to the Yellowbrick instance. You might need to create a new user, which can be done through Yellowbrick Manager -> Instance Management -> Access Control -> User.

Specify the database and table where data will be loaded. Use learn_ybload for the database name and public.target for the table name.

Part 3: Configure deploy.sh

Staying in the same folder, create your Lambda function:

bash
./deploy.sh

This script will perform the following:

  • Extract ybload and its dependencies from the ybtools tarball
  • Compile the Lambda function in the file YBLoadLambda.java
  • Upload the function and ybload into a zip file in the ybload-zip bucket
  • Create an IAM role and permissions for the Lambda function
  • Create and configure the function
  • Set up a trigger to call the function when a file lands in the ybload-source bucket.

Check the output of the script for errors, and correct before proceeding.

Part 4: Configure the Yellowbrick Target Database and Table

After setting up the Lambda function, create a target database and table to receive the data loaded by the Lambda function. For this, use the Query Editor in Yellowbrick Manager, ybsql, or your preferred SQL editor.

sql
CREATE DATABASE learn_ybload;

In the new database, create the target table:

sql
CREATE TABLE public.target(id INT PRIMARY KEY, value VARCHAR(128));

Part 5: Load Some Data

Create and save the following data into the file test.csv:

1, "row 1"
2, "row 2"
3, "row 3"

Copy the file to your S3 source bucket to trigger a data load:

aws s3 cp test.csv s3://ybload-source/

Part 6: Examine the Output of Lambda Function in CloudWatch

The Lambda function logs data loads in AWS CloudWatch. You access the logs from the command line by using the following:

aws logs tail /aws/lambda/YBLoadFunction --follow

You should see a log output that culminates in the following:

[ INFO] SUCCESSFUL BULK LOAD: Loaded 3 good rows from 1 source(s)

This signifies the data has been successfully loaded into Yellowbrick.

Part 7: Upserting Data

A powerful feature of ybload is its ability to perform upserts (i.e., insert new records or update existing records, avoid duplicates). To demonstrate this, create a new CSV file test2.csv with the following entries:

1, "row 1"
2, "row 2 altered"
4, "row 4"

Loading this data into the table should result in the following data in the target table:

1, "row 1"
2, "row 2 altered"
3, "row 3"
4, "row 4"

The result of loading this file is to ignore the duplicated first record, update the second record and add the fourth record. You need to configure the Lambda function before you can enable this upsert functionality.

Edit deploy.sh and set the following:

YBLOAD_EXTRA_ARGS="--disable-trust --write-op upsert --key-field-names id"

This informs ybload to use the id field as the key to ensure you avoid loading duplicate records. This also informs ybload to perform an upsert write operation. Run ./deploy.sh once more to update the Lambda function, then copy test2.csv into the ybload-source S3 bucket and observe the CloudWatch logs. Check that the load is completed without error.

In Yellowbrick Manager or ybsql run:

sql
SELECT * FROM target ORDER BY id;

The contents of the table should be:

1, "row 1"
2, "row 2 altered"
3, "row 3"
4, "row 4"

You have successfully upserted data into the table.

Part 8: Understanding the Strengths and Limitations

Using AWS Lambda with ybload provides a convenient, scalable, and robust approach to loading data into Yellowbrick in a way that avoids loading duplicate records. There are limitations to this approach; however, bear in mind that a Lambda function has a limited runtime of 15 minutes. This means the execution of the Lambda function (and your data load) will be stopped if the loading process takes over 15 minutes, resulting in no data being loaded. As a rule, assume that the data will load from S3 at the maximum rate of 70 MB/s on a compute cluster with one small-v2 node. Given this rate, the largest file that can be processed by the function within the Lambda runtime limit is around 60 GB.

Note: In practice, limit the maximum size of a single file to around 50 GB for this configuration.

Even given this limitation, it pays to exploit the horizontal scalability of AWS Lambda by landing many smaller files into the S3 source bucket more often. By doing this, a separate Lambda function will be invoked to process each file, massively increasing the load throughput and reducing the time for data to load in Yellowbrick.

Note: The number of load tasks that can execute at the same time is dictated by Yellowbrick's WLM rules and the network bandwidth available to the compute cluster that performs the load.

If Lambda load tasks get queued, there is a chance that the Lambda function will be halted before ybload tries to execute or complete its load. Consider creating a dedicated cluster to process loads from your Lambda function with a simple WLM profile. This WLM profile should have a single pool and an equal minumum and maximum concurrency. As a guide, for a cluster with a single small-v1 compute node, a minimum and maximum WLM concurrency of 8 will saturate the IO to and fro S3. Increase the size of the cluster to increase throughput.

The Lambda function you created cannot only load CSV data, it can also load compressed CSV and Parquet files. Drop compressed CSV files with a .csv.gz suffix, or Parquet files with a .parquet suffix into the S3 bucket to try it out.