Appearance
Check an S3 Bucket with an AWS Lambda Function and Load Data into Yellowbrick
Description
This tutorial demonstrates how to automate the loading of an existing table. You will use an AWS Lambda function to check an S3 bucket and load the required data file into a Yellowbrick database. The destination table name gets determined by the folder name (for the database schema) and the file name without the extension (for the table name).
Learning Objectives
By completing this tutorial, you will understand how to do the following:
- Create an AWS Lambda function in Python with a file-based trigger.
- Write Python code in an AWS Lambda function to authenticate and connect to a Yellowbrick database.
- Run SQL statements to load data into an external table.
Note: This automation pattern has high adaptability for various use cases, such as scaling database resources automatically.
Prerequisites
To complete this tutorial, ensure you have the following prerequisites:
- Access to a Yellowbrick database instance and an active cluster.
- Permissions to create and manage files in an AWS S3 bucket.
- Permissions to create an AWS Lambda function.
- Network connectivity between your AWS Lambda function and your Yellowbrick Data Warehouse.
Step-by-Step Guide
1. Prepare Your Database
Log in to Yellowbrick Manager.
In the Query Editor, run the following SQL statements to create a schema and table:
sqlCREATE SCHEMA IF NOT EXISTS premdb; DROP TABLE IF EXISTS premdb.match; CREATE TABLE premdb.match ( seasonid SMALLINT, matchday TIMESTAMP WITHOUT TIME ZONE, htid SMALLINT, atid SMALLINT, ftscore CHAR(3), htscore CHAR(3) ) DISTRIBUTE ON (seasonid);
2. Set Up S3 Storage and Access Keys
a. Create an AWS Access Key
- Log in to your AWS account.
- In the search bar, type "IAM" and select the service.
- Click your username under Users.
- Under Security credentials, click Create access key.
- Select Third-party service.
- Add a description (optional) (e.g., "Access Key for Yellowbrick Lambda Loader").
- Click Create access key.
Note: AWS allows only two access keys per user account. Reuse existing keys, if necessary.
b. Create an S3 Bucket
- In the AWS console, search for S3. In the AWS console, type "S3" in the search bar.
- Click Create bucket and name it "yb-lambda-loader-v1."
- Allow public access and acknowledge the prompt.
c. Create Folders in the S3 Bucket
- Inside the yb-lambda-loader-v1 bucket:
- Click Create folder and name it "premdb."
- Inside premdb, create another folder named "match."
Your bucket structure should be as follows:
yb-lambda-loader-v1/
└── premdb/
└── match/
3. Configure External Storage in Yellowbrick
Run the following SQL statements, replacing placeholders with your details:
sql
CREATE EXTERNAL FORMAT IF NOT EXISTS premdbs3format_lambda_<username>
TYPE csv WITH (
delimiter ',',
skip_blank_lines 'true'
);
CREATE EXTERNAL STORAGE IF NOT EXISTS premdbs3storage_lambda_<username>
TYPE s3
ENDPOINT 'https://s3.<your-region>.amazonaws.com'
REGION '<your-region>'
IDENTITY '<access_key>'
CREDENTIAL '<secret_key>';
CREATE EXTERNAL LOCATION IF NOT EXISTS premdbs3location_lambda_<username>
PATH 'yb-lambda-loader-v1'
EXTERNAL STORAGE premdbs3storage_lambda_<username>;
Note: Replace
<your-region>
with your AWS region and ensure thePATH
matches your bucket name.
4. Create and Configure Your AWS Lambda Function
a. Create a Lambda Function
- In the AWS console, type "Lambda" in the search bar.
- Click Create function.
- Enter a name (e.g., "yb-lambda-loader-v1").
- Select Python 3.8 as the runtime.
- Click Create function.
b. Upload Function Code
- Download the lambda_load_table.py file.
- Zip it into
lambda_load_table.zip
. - In the Lambda console, select Upload from -> .zip file.
- Upload your zipped file.
c. Set Function Configuration
- Go to Configuration -> General configuration.
- Click Edit and set the timeout to 30 seconds.
- Click Save.
d. Add Environment Variables
In Configuration -> Environment variables, add the following values to their keys:
Key | Value |
---|---|
YBDATABASE | <your Yellowbrick database name> |
YBUSER | <your Yellowbrick username> |
YBPASSWORD | <your Yellowbrick password> |
YBSCHEMA | premdb |
YBHOST | <host URL from Yellowbrick Manager> |
YBEXTERNALFORMAT | premdbs3format_lambda_<username> |
YBEXTERNALLOCATION | premdbs3location_lambda_<username> |
e. Set Permissions
- In Configuration -> Permissions, select the role created for your Lambda function.
- Click Add permissions -> Create inline policy.
- Select the JSON tab and paste the following JSON statements:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:*"],
"Resource": "arn:aws:s3:::*"
}
]
}
- Click Next, name the policy (e.g., "s3-lambda-access"), and create it.
f. Add an S3 Trigger
- Navigate to Triggers and click Add trigger.
- Select S3 as the source.
- Choose your bucket (
yb-lambda-loader-v1
). - Set Event type to PUT.
- In Prefix (optional), type "premdb/."
- Check the acknowledgment box and click Add.
5. Test the Lambda Function
a. Upload a Test File
- Download the sample match CSV file.
- In the S3 console, navigate to
premdb/match/
. - Click Upload, select the test file, and complete the upload.
b. Verify Lambda Execution
- In the Lambda console, go to Monitor -> View CloudWatch logs.
- Confirm that the Lambda function triggers and loads the rows successfully.
c. Verify Data in Yellowbrick
- In Yellowbrick Manager, open the Query Editor.
- Run the following SQL statement:
sql
SELECT COUNT(*) FROM premdb.match;
- Ensure the row count matches that of the uploaded CSV file.
d. Load Additional Files
Upload new CSV files with the same structure to premdb/match/
to trigger automatic data loading.
Conclusion
You have successfully automated data loading in Yellowbrick by using an AWS Lambda function that monitors an S3 bucket. This setup can streamline data pipelines and improve operational efficiency.