Appearance
Installing Airbyte
This guide details the steps required to install Airbyte in the same AWS EKS cluster that hosts Yellowbrick. While this guide is aimed at AWS, the same pattern can be followed with Azure and GCP, substituting the cloud provider-specific commands with suitable alternatives.
You will need the following packages installed locally:
- Helm command line client —
helm - AWS command line client —
aws - AWS EKS command line client —
eksctl - Kubernetes command line client —
kubectl - Command for substituting environment variables —
envsubst. This isn't strictly necessary, but it helps for configuring the install for different environments.
You will need to be logged into the AWS account hosting Yellowbrick to be able to complete this guide. You will also need access to the Yellowbrick EKS cluster and its associated Kube config file. You can obtain this by running:
sh
aws eks update-kubeconfig --region REGION --name CLUSTERWhere REGION is the AWS region the EKS cluster was created in, and CLUSTER is the name of the EKS cluster running Yellowbrick.
The process of installing Airbyte into the Yellowbrick EKS cluster involves:
- Creating a new EKS node group and scaling up a single m5.2xlarge EC2 node in the group
- Creating an RDS instance to store Airbyte state
- Creating an IAM role for Airbyte and assigning an S3 access policy to it. Airbyte will use the S3 bucket to store log files
- Deploying the Airbyte application on EKS using Helm
Step 1: Configuring the Environment
Create a file airbyte-env.sh and paste in the following:
sh
#!/bin/bash
export CLUSTER_NAME="<CLUSTER NAME>"
export AIRBYTE_BUCKET_NAME="<AIRBYTE S3 BUCKET>"
export RDS_USER="airbyte"
export RDS_PASSWORD="<RDS PASSWORD>"
export NODE_TYPE="m5.2xlarge"
export NODEGROUP_NAME="yb-airbyte-nodegroup"
export REGION="<AWS REGION>"
export NAMESPACE="airbyte"
export RELEASE_NAME="airbyte"
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export VPC_ID=$(aws eks describe-cluster \
--name "$CLUSTER_NAME" \
--region "$REGION" \
--query "cluster.resourcesVpcConfig.vpcId" \
--output text)
export SUBNET_IDS=$(aws eks describe-cluster \
--name "$CLUSTER_NAME" \
--region "$REGION" \
--query "cluster.resourcesVpcConfig.subnetIds" \
--output text)
export SUBNET1=$(echo "$SUBNET_IDS" | awk '{print $1}')
export SUBNET2=$(echo "$SUBNET_IDS" | awk '{print $2}')
export AZ1=$(aws ec2 describe-subnets \
--subnet-ids "$SUBNET1" \
--query "Subnets[0].AvailabilityZone" \
--output text)
export AZ2=$(aws ec2 describe-subnets \
--subnet-ids "$SUBNET2" \
--query "Subnets[0].AvailabilityZone" \
--output text)
export SECURITY_GROUP=$(aws ec2 describe-security-groups \
--region "$REGION" \
--filters "Name=vpc-id,Values=$VPC_ID" "Name=group-name,Values=eks-cluster-sg-$CLUSTER_NAME*" \
--query "SecurityGroups[0].GroupId" \
--output text)
export OIDC_PROVIDER=$(aws eks describe-cluster \
--name "$CLUSTER_NAME" \
--region "$REGION" \
--query "cluster.identity.oidc.issuer" \
--output text | sed 's/^https:\/\///')Modify the variables inside the <> placeholders and save the file. CLUSTER_NAME should be the name of the Kubernetes cluster that Yellowbrick is installed on. AIRBYTE_BUCKET_NAME is the name of the S3 bucket Airbyte will use to store logs. The installation steps below expect this bucket to exist, so create the bucket now if it doesn't exist. Also, set the password for the RDS instance that Airbyte will use to store state: RDS PASSWORD. These environment variables are used in the steps following.
Set up the environment by executing:
sh
source ./airbyte-env.shStep 2: Create a New Node Group
Airbyte should not run on any of the EC2 nodes in the EKS cluster that run Yellowbrick pods. Instead, create a new node group in the EKS cluster for Airbyte's use. Create a YAML file nodegroup.yaml.template with the contents:
yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${REGION}
vpc:
id: ${VPC_ID}
subnets:
private:
${AZ1}:
id: ${SUBNET1}
${AZ2}:
id: ${SUBNET2}
securityGroup: ${SECURITY_GROUP}
managedNodeGroups:
- name: ${NODEGROUP_NAME}
instanceType: ${NODE_TYPE}
desiredCapacity: 0
minSize: 0
maxSize: 1
labels:
workload: "airbyte"
cluster.yellowbrick.io/owned: "true"
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/${CLUSTER_NAME}: "owned"
# Important node-template tags for scale-from-zero
k8s.io/cluster-autoscaler/node-template/label/workload: "airbyte"
amiFamily: AmazonLinux2
privateNetworking: trueThe node group will be set up to launch a maximum of one m5.2xlarge EC2 nodes to host Airbyte. Adjust the node type and number of nodes depending on your requirements.
Create and scale the node group by executing the commands:
sh
envsubst < nodegroup.yaml.template > nodegroup.yaml
eksctl create nodegroup -f nodegroup.yaml
eksctl scale nodegroup --cluster $CLUSTER_NAME --region $REGION --name $NODEGROUP_NAME --nodes 1Step 3: Create RDS instance
To create the AWS RDS instance needed to store Airbyte state, first create an RDS subnet group. You will use the two subnets that already exist in the AWS EKS cluster that hosts Yellowbrick to form the group. Run the following:
sh
aws rds create-db-subnet-group \
--db-subnet-group-name airbyte-subnet-group \
--db-subnet-group-description "Subnet group for Airbyte RDS" \
--subnet-ids "$SUBNET1" "$SUBNET2" \
--tags Key=Project,Value=airbyte \
--region "$REGION"Next, create the RDS instance and wait for it to be created:
sh
aws rds create-db-instance \
--db-instance-identifier airbyte-db \
--db-instance-class db.t3.micro \
--engine postgres \
--allocated-storage 20 \
--db-name airbyte \
--master-username "$RDS_USER" \
--master-user-password "$RDS_PASSWORD" \
--vpc-security-group-ids "$SECURITY_GROUP" \
--db-subnet-group-name airbyte-subnet-group \
--backup-retention-period 7 \
--no-publicly-accessible \
--region "$REGION" \
--no-cli-pager
aws rds wait db-instance-available \
--db-instance-identifier airbyte-db \
--region "$REGION"Capture the RDS host name in an environment variable for use in later steps:
sh
export RDS_HOST=$(aws rds describe-db-instances \
--db-instance-identifier airbyte-db \
--query "DBInstances[0].Endpoint.Address" \
--output text)Step 4: Configure S3 Access for Airbyte
Create an IAM role and access policies that allow Airbyte to read and write from the S3 bucket specified earlier. Create the AWS policy file airbyte-s3-policy.json.template:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAirbyteS3",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::$AIRBYTE_BUCKET_NAME",
"arn:aws:s3:::$AIRBYTE_BUCKET_NAME/*"
]
}
]
}Create and apply the policy:
sh
envsubst < ./airbyte-s3-policy.json.template > /tmp/airbyte-s3-policy.json
aws iam create-policy \
--policy-name AirbyteS3Access \
--policy-document file:///tmp/airbyte-s3-policy.json \
--region "$REGION"
export AIRBYTE_POLICY_ARN=$(aws iam list-policies --query "Policies[?PolicyName=='AirbyteS3Access'].Arn" --output text)Create an IAM role for Airbyte, and attach the S3 policy by saving the following into a file trust-policy.json.template:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/$OIDC_PROVIDER"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:${NAMESPACE}:airbyte-admin"
}
}
}
]
}This will allow the Airbyte pods (via an Airbyte service account) to assume the associated IAM role, by verifying their identity through the EKS cluster’s OIDC identity provider.
Create the role by executing:
sh
envsubst < ./trust-policy.json.template > /tmp/trust-policy.json
aws iam create-role \
--role-name AirbyteAdminS3AccessRole \
--assume-role-policy-document file:///tmp/trust-policy.jsonAttach the S3 access policy to the role:
sh
aws iam attach-role-policy \
--role-name AirbyteAdminS3AccessRole \
--policy-arn "$AIRBYTE_POLICY_ARN"Step 5: Install Airbyte using Helm
Create the Kubernetes namespace in which Airbyte will be deployed. Save the following into a file called secrets.yaml.template:
yaml
apiVersion: v1
kind: Secret
metadata:
name: airbyte-config-secrets
type: Opaque
stringData:
database-password: ${RDS_PASSWORD}Create the namespace and save the RDS password secret using:
sh
kubectl create namespace "$NAMESPACE"
envsubst < secrets.yaml.template > secrets.yaml
kubectl apply -f secrets.yaml -n "$NAMESPACE"Create the Airbyte service account and associate it with the IAM role created earlier:
sh
kubectl create serviceaccount airbyte-admin -n "$NAMESPACE"
kubectl annotate serviceaccount airbyte-admin \
-n "$NAMESPACE" \
eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/AirbyteAdminS3AccessRole \
--overwriteCreate the Kubernetes role airbyte-admin-role and a role binding, which ties together the role with the service account airbyte-admin within the Airbyte namespace. Create a file airbyte-role.yaml.template and insert the following YAML:
yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: airbyte-admin-role
namespace: $NAMESPACE
rules:
- apiGroups: ["*"]
resources: ["jobs", "pods", "pods/log", "pods/exec", "pods/attach", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: airbyte-admin-rolebinding
namespace: $NAMESPACE
subjects:
- kind: ServiceAccount
name: airbyte-admin
namespace: $NAMESPACE
roleRef:
kind: Role
name: airbyte-admin-role
apiGroup: rbac.authorization.k8s.ioRun the following to create the role:
sh
envsubst < "./airbyte-role.yaml.template" | kubectl apply -f -Finally, install Airbyte by first creating a values.yaml.template file with the content:
yaml
serviceAccount:
create: false
name: airbyte-admin
postgresql:
enabled: false
global:
database:
type: external
secretName: "airbyte-config-secrets"
host: ${RDS_HOST}
port: 5432
database: "airbyte"
user: ${RDS_USER}
passwordSecretKey: "database-password"
storage:
type: "S3"
secretName: "airbyte-config-secrets"
bucket:
log: ${AIRBYTE_BUCKET_NAME}
state: ${AIRBYTE_BUCKET_NAME}
workloadOutput: ${AIRBYTE_BUCKET_NAME}
s3:
region: ${REGION}
authenticationType: instanceProfileThen execute the following commands:
sh
helm repo add airbyte https://airbytehq.github.io/helm-charts
helm repo update
envsubst < ./values.yaml.template > ./values.yaml
helm install airbyte airbyte/airbyte --namespace "$NAMESPACE" --values ./values.yamlAfter several minutes, the Airbyte service will be available. To access the Airbyte application, set up port forwarding to the service:
sh
kubectl -n "$NAMESPACE" port-forward deployment/airbyte-webapp 8080:8080Access the Airbyte console using the url:
http://localhost:8080/Step 6: Uninstalling Airbyte and Supporting Resources
Source the environment file:
sh
source ./airbyte-env.shUninstall Airbyte, the namespace and the Kubernetes service account and role:
sh
helm uninstall "$RELEASE_NAME" -n "$NAMESPACE"
kubectl delete rolebinding airbyte-admin-rolebinding -n "$NAMESPACE" --ignore-not-found
kubectl delete role airbyte-admin-role -n "$NAMESPACE" --ignore-not-found
kubectl delete serviceaccount airbyte-admin -n "$NAMESPACE" --ignore-not-found
kubectl delete namespace "$NAMESPACE"Detach and delete the AWS IAM policies and role:
sh
if aws iam get-role --role-name AirbyteAdminS3AccessRole >/dev/null 2>&1; then
ATTACHED_POLICIES=$(aws iam list-attached-role-policies \
--role-name AirbyteAdminS3AccessRole \
--query "AttachedPolicies[].PolicyArn" \
--output text)
for POLICY_ARN in $ATTACHED_POLICIES; do
aws iam detach-role-policy --role-name AirbyteAdminS3AccessRole --policy-arn "$POLICY_ARN"
done
fi
aws iam delete-role \
--role-name AirbyteAdminS3AccessRole
aws iam delete-policy \
--policy-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/AirbyteS3Access"Delete the RDS instance and subnet group:
sh
aws rds delete-db-instance \
--db-instance-identifier airbyte-db \
--skip-final-snapshot \
--region "$REGION" \
--no-cli-pager
aws rds wait db-instance-deleted \
--db-instance-identifier airbyte-db \
--region "$REGION"
aws rds delete-db-subnet-group \
--db-subnet-group-name airbyte-subnet-group \
--region "$REGION"Finally, delete the EKS node group:
sh
eksctl delete nodegroup \
--cluster "$CLUSTER_NAME" \
--name "$NODEGROUP_NAME" \
--region "$REGION"