Check an S3 Bucket with an AWS Lambda Function and Load Data into Yellowbrick

Description

This tutorial demonstrates how to automate the loading of an existing table. You will use an AWS Lambda function to check an S3 bucket and load the required data file into a Yellowbrick database. The destination table name gets determined by the folder name (for the database schema) and the file name without the extension (for the table name).

Learning Objectives

By completing this tutorial, you will understand how to do the following:

Create an AWS Lambda function in Python with a file-based trigger.
Write Python code in an AWS Lambda function to authenticate and connect to a Yellowbrick database.
Run SQL statements to load data into an external table.

Note: This automation pattern has high adaptability for various use cases, such as scaling database resources automatically.

Prerequisites

To complete this tutorial, ensure you have the following prerequisites:

Access to a Yellowbrick database instance and an active cluster.
Permissions to create and manage files in an AWS S3 bucket.
Permissions to create an AWS Lambda function.
Network connectivity between your AWS Lambda function and your Yellowbrick Data Warehouse.

Step-by-Step Guide

1. Prepare Your Database

In the Query Editor, run the following SQL statements to create a schema and table:

sql

CREATE SCHEMA IF NOT EXISTS premdb;
DROP TABLE IF EXISTS premdb.match;
CREATE TABLE premdb.match (
    seasonid SMALLINT,
    matchday TIMESTAMP WITHOUT TIME ZONE,
    htid SMALLINT,
    atid SMALLINT,
    ftscore CHAR(3),
    htscore CHAR(3)
)
DISTRIBUTE ON (seasonid);

2. Set Up S3 Storage and Access Keys

a. Create an AWS Access Key

Log in to your AWS account.
In the search bar, type "IAM" and select the service.
Click your username under Users.
Under Security credentials, click Create access key.
Select Third-party service.
Add a description (optional) (e.g., "Access Key for Yellowbrick Lambda Loader").
Click Create access key.

Note: AWS allows only two access keys per user account. Reuse existing keys, if necessary.

b. Create an S3 Bucket

In the AWS console, search for S3. In the AWS console, type "S3" in the search bar.
Click Create bucket and name it "yb-lambda-loader-v1."
Allow public access and acknowledge the prompt.

c. Create Folders in the S3 Bucket

Inside the yb-lambda-loader-v1 bucket:
- Click Create folder and name it "premdb."
- Inside premdb, create another folder named "match."

Your bucket structure should be as follows:

yb-lambda-loader-v1/
└── premdb/
    └── match/

3. Configure External Storage in Yellowbrick

Run the following SQL statements, replacing placeholders with your details:

sql

CREATE EXTERNAL FORMAT IF NOT EXISTS premdbs3format_lambda_<username>
TYPE csv WITH (
    delimiter ',',
    skip_blank_lines 'true'
);

CREATE EXTERNAL STORAGE IF NOT EXISTS premdbs3storage_lambda_<username>
TYPE s3
ENDPOINT 'https://s3.<your-region>.amazonaws.com'
REGION '<your-region>'
IDENTITY '<access_key>'
CREDENTIAL '<secret_key>';

CREATE EXTERNAL LOCATION IF NOT EXISTS premdbs3location_lambda_<username>
PATH 'yb-lambda-loader-v1'
EXTERNAL STORAGE premdbs3storage_lambda_<username>;

Note: Replace <your-region> with your AWS region and ensure the PATH matches your bucket name.

4. Create and Configure Your AWS Lambda Function

a. Create a Lambda Function

In the AWS console, type "Lambda" in the search bar.
Click Create function.
Enter a name (e.g., "yb-lambda-loader-v1").
Select Python 3.8 as the runtime.
Click Create function.

b. Upload Function Code

Download the lambda_load_table.py file.
Zip it into lambda_load_table.zip.
In the Lambda console, select Upload from -> .zip file.
Upload your zipped file.

c. Set Function Configuration

Go to Configuration -> General configuration.
Click Edit and set the timeout to 30 seconds.
Click Save.

d. Add Environment Variables

In Configuration -> Environment variables, add the following values to their keys:

Key	Value
YBDATABASE	`<your Yellowbrick database name>`
YBUSER	`<your Yellowbrick username>`
YBPASSWORD	`<your Yellowbrick password>`
YBSCHEMA	`premdb`
YBHOST	`<host URL from Yellowbrick Manager>`
YBEXTERNALFORMAT	`premdbs3format_lambda_<username>`
YBEXTERNALLOCATION	`premdbs3location_lambda_<username>`

e. Set Permissions

In Configuration -> Permissions, select the role created for your Lambda function.
Click Add permissions -> Create inline policy.
Select the JSON tab and paste the following JSON statements:

json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["s3:*"],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}

Click Next, name the policy (e.g., "s3-lambda-access"), and create it.

f. Add an S3 Trigger

Navigate to Triggers and click Add trigger.
Select S3 as the source.
Choose your bucket (yb-lambda-loader-v1).
Set Event type to PUT.
In Prefix (optional), type "premdb/."
Check the acknowledgment box and click Add.

5. Test the Lambda Function

a. Upload a Test File

Download the sample match CSV file.
In the S3 console, navigate to premdb/match/.
Click Upload, select the test file, and complete the upload.

b. Verify Lambda Execution

In the Lambda console, go to Monitor -> View CloudWatch logs.
Confirm that the Lambda function triggers and loads the rows successfully.

c. Verify Data in Yellowbrick

In Yellowbrick Manager, open the Query Editor.
Run the following SQL statement:

sql

SELECT COUNT(*) FROM premdb.match;

Ensure the row count matches that of the uploaded CSV file.

d. Load Additional Files

Upload new CSV files with the same structure to premdb/match/ to trigger automatic data loading.

Conclusion

You have successfully automated data loading in Yellowbrick by using an AWS Lambda function that monitors an S3 bucket. This setup can streamline data pipelines and improve operational efficiency.

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB