Skip to content

Check an S3 Bucket with an AWS Lambda Function and Load Data into Yellowbrick

Description

This tutorial demonstrates how to automate the loading of an existing table. You will use an AWS Lambda function to check an S3 bucket and load the required data file into a Yellowbrick database. The destination table name gets determined by the folder name (for the database schema) and the file name without the extension (for the table name).

Learning Objectives

By completing this tutorial, you will understand how to do the following:

  • Create an AWS Lambda function in Python with a file-based trigger.
  • Write Python code in an AWS Lambda function to authenticate and connect to a Yellowbrick database.
  • Run SQL statements to load data into an external table.

Note: This automation pattern has high adaptability for various use cases, such as scaling database resources automatically.

Prerequisites

To complete this tutorial, ensure you have the following prerequisites:

  • Access to a Yellowbrick database instance and an active cluster.
  • Permissions to create and manage files in an AWS S3 bucket.
  • Permissions to create an AWS Lambda function.
  • Network connectivity between your AWS Lambda function and your Yellowbrick Data Warehouse.

Step-by-Step Guide

1. Prepare Your Database

  1. Log in to Yellowbrick Manager.

  2. In the Query Editor, run the following SQL statements to create a schema and table:

    sql
    CREATE SCHEMA IF NOT EXISTS premdb;
    DROP TABLE IF EXISTS premdb.match;
    CREATE TABLE premdb.match (
        seasonid SMALLINT,
        matchday TIMESTAMP WITHOUT TIME ZONE,
        htid SMALLINT,
        atid SMALLINT,
        ftscore CHAR(3),
        htscore CHAR(3)
    )
    DISTRIBUTE ON (seasonid);

2. Set Up S3 Storage and Access Keys

a. Create an AWS Access Key

  1. Log in to your AWS account.
  2. In the search bar, type "IAM" and select the service.
  3. Click your username under Users.
  4. Under Security credentials, click Create access key.
  5. Select Third-party service.
  6. Add a description (optional) (e.g., "Access Key for Yellowbrick Lambda Loader").
  7. Click Create access key.

Note: AWS allows only two access keys per user account. Reuse existing keys, if necessary.

b. Create an S3 Bucket

  1. In the AWS console, search for S3. In the AWS console, type "S3" in the search bar.
  2. Click Create bucket and name it "yb-lambda-loader-v1."
  3. Allow public access and acknowledge the prompt.

c. Create Folders in the S3 Bucket

  1. Inside the yb-lambda-loader-v1 bucket:
    • Click Create folder and name it "premdb."
    • Inside premdb, create another folder named "match."

Your bucket structure should be as follows:

yb-lambda-loader-v1/
└── premdb/
    └── match/

3. Configure External Storage in Yellowbrick

Run the following SQL statements, replacing placeholders with your details:

sql
CREATE EXTERNAL FORMAT IF NOT EXISTS premdbs3format_lambda_<username>
TYPE csv WITH (
    delimiter ',',
    skip_blank_lines 'true'
);

CREATE EXTERNAL STORAGE IF NOT EXISTS premdbs3storage_lambda_<username>
TYPE s3
ENDPOINT 'https://s3.<your-region>.amazonaws.com'
REGION '<your-region>'
IDENTITY '<access_key>'
CREDENTIAL '<secret_key>';

CREATE EXTERNAL LOCATION IF NOT EXISTS premdbs3location_lambda_<username>
PATH 'yb-lambda-loader-v1'
EXTERNAL STORAGE premdbs3storage_lambda_<username>;

Note: Replace <your-region> with your AWS region and ensure the PATH matches your bucket name.

4. Create and Configure Your AWS Lambda Function

a. Create a Lambda Function

  1. In the AWS console, type "Lambda" in the search bar.
  2. Click Create function.
  3. Enter a name (e.g., "yb-lambda-loader-v1").
  4. Select Python 3.8 as the runtime.
  5. Click Create function.

b. Upload Function Code

  1. Download the lambda_load_table.py file.
  2. Zip it into lambda_load_table.zip.
  3. In the Lambda console, select Upload from -> .zip file.
  4. Upload your zipped file.

c. Set Function Configuration

  1. Go to Configuration -> General configuration.
  2. Click Edit and set the timeout to 30 seconds.
  3. Click Save.

d. Add Environment Variables

In Configuration -> Environment variables, add the following values to their keys:

KeyValue
YBDATABASE<your Yellowbrick database name>
YBUSER<your Yellowbrick username>
YBPASSWORD<your Yellowbrick password>
YBSCHEMApremdb
YBHOST<host URL from Yellowbrick Manager>
YBEXTERNALFORMATpremdbs3format_lambda_<username>
YBEXTERNALLOCATIONpremdbs3location_lambda_<username>

e. Set Permissions

  1. In Configuration -> Permissions, select the role created for your Lambda function.
  2. Click Add permissions -> Create inline policy.
  3. Select the JSON tab and paste the following JSON statements:
json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["s3:*"],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}
  1. Click Next, name the policy (e.g., "s3-lambda-access"), and create it.

f. Add an S3 Trigger

  1. Navigate to Triggers and click Add trigger.
  2. Select S3 as the source.
  3. Choose your bucket (yb-lambda-loader-v1).
  4. Set Event type to PUT.
  5. In Prefix (optional), type "premdb/."
  6. Check the acknowledgment box and click Add.

5. Test the Lambda Function

a. Upload a Test File

  1. Download the sample match CSV file.
  2. In the S3 console, navigate to premdb/match/.
  3. Click Upload, select the test file, and complete the upload.

b. Verify Lambda Execution

  1. In the Lambda console, go to Monitor -> View CloudWatch logs.
  2. Confirm that the Lambda function triggers and loads the rows successfully.

c. Verify Data in Yellowbrick

  1. In Yellowbrick Manager, open the Query Editor.
  2. Run the following SQL statement:
sql
SELECT COUNT(*) FROM premdb.match;
  1. Ensure the row count matches that of the uploaded CSV file.

d. Load Additional Files

Upload new CSV files with the same structure to premdb/match/ to trigger automatic data loading.

Conclusion

You have successfully automated data loading in Yellowbrick by using an AWS Lambda function that monitors an S3 bucket. This setup can streamline data pipelines and improve operational efficiency.