Skip to content

Installing Airbyte

This guide details the steps required to install Airbyte in the same AWS EKS cluster that hosts Yellowbrick. While this guide is aimed at AWS, the same pattern can be followed with Azure and GCP, substituting the cloud provider-specific commands with suitable alternatives.

You will need the following packages installed locally:

  • Helm command line client — helm
  • AWS command line client — aws
  • AWS EKS command line client — eksctl
  • Kubernetes command line client — kubectl
  • Command for substituting environment variables — envsubst. This isn't strictly necessary, but it helps for configuring the install for different environments.

You will need to be logged into the AWS account hosting Yellowbrick to be able to complete this guide. You will also need access to the Yellowbrick EKS cluster and its associated Kube config file. You can obtain this by running:

sh
aws eks update-kubeconfig --region REGION --name CLUSTER

Where REGION is the AWS region the EKS cluster was created in, and CLUSTER is the name of the EKS cluster running Yellowbrick.

The process of installing Airbyte into the Yellowbrick EKS cluster involves:

  • Creating a new EKS node group and scaling up a single m5.2xlarge EC2 node in the group
  • Creating an RDS instance to store Airbyte state
  • Creating an IAM role for Airbyte and assigning an S3 access policy to it. Airbyte will use the S3 bucket to store log files
  • Deploying the Airbyte application on EKS using Helm

Step 1: Configuring the Environment

Create a file airbyte-env.sh and paste in the following:

sh
#!/bin/bash

export CLUSTER_NAME="<CLUSTER NAME>"
export AIRBYTE_BUCKET_NAME="<AIRBYTE S3 BUCKET>"
export RDS_USER="airbyte"
export RDS_PASSWORD="<RDS PASSWORD>"
export NODE_TYPE="m5.2xlarge"
export NODEGROUP_NAME="yb-airbyte-nodegroup"
export REGION="<AWS REGION>"
export NAMESPACE="airbyte"
export RELEASE_NAME="airbyte"

export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

export VPC_ID=$(aws eks describe-cluster \
  --name "$CLUSTER_NAME" \
  --region "$REGION" \
  --query "cluster.resourcesVpcConfig.vpcId" \
  --output text)

export SUBNET_IDS=$(aws eks describe-cluster \
  --name "$CLUSTER_NAME" \
  --region "$REGION" \
  --query "cluster.resourcesVpcConfig.subnetIds" \
  --output text)

export SUBNET1=$(echo "$SUBNET_IDS" | awk '{print $1}')
export SUBNET2=$(echo "$SUBNET_IDS" | awk '{print $2}')

export AZ1=$(aws ec2 describe-subnets \
  --subnet-ids "$SUBNET1" \
  --query "Subnets[0].AvailabilityZone" \
  --output text)

export AZ2=$(aws ec2 describe-subnets \
  --subnet-ids "$SUBNET2" \
  --query "Subnets[0].AvailabilityZone" \
  --output text)

export SECURITY_GROUP=$(aws ec2 describe-security-groups \
  --region "$REGION" \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=group-name,Values=eks-cluster-sg-$CLUSTER_NAME*" \
  --query "SecurityGroups[0].GroupId" \
  --output text)

export OIDC_PROVIDER=$(aws eks describe-cluster \
  --name "$CLUSTER_NAME" \
  --region "$REGION" \
  --query "cluster.identity.oidc.issuer" \
  --output text | sed 's/^https:\/\///')

Modify the variables inside the <> placeholders and save the file. CLUSTER_NAME should be the name of the Kubernetes cluster that Yellowbrick is installed on. AIRBYTE_BUCKET_NAME is the name of the S3 bucket Airbyte will use to store logs. The installation steps below expect this bucket to exist, so create the bucket now if it doesn't exist. Also, set the password for the RDS instance that Airbyte will use to store state: RDS PASSWORD. These environment variables are used in the steps following.

Set up the environment by executing:

sh
source ./airbyte-env.sh

Step 2: Create a New Node Group

Airbyte should not run on any of the EC2 nodes in the EKS cluster that run Yellowbrick pods. Instead, create a new node group in the EKS cluster for Airbyte's use. Create a YAML file nodegroup.yaml.template with the contents:

yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ${REGION}

vpc:
  id: ${VPC_ID}
  subnets:
    private:
      ${AZ1}:
        id: ${SUBNET1}
      ${AZ2}:
        id: ${SUBNET2}
  securityGroup: ${SECURITY_GROUP}

managedNodeGroups:
  - name: ${NODEGROUP_NAME}
    instanceType: ${NODE_TYPE}
    desiredCapacity: 0
    minSize: 0
    maxSize: 1
    labels:
      workload: "airbyte"
      cluster.yellowbrick.io/owned: "true"
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/${CLUSTER_NAME}: "owned"

      # Important node-template tags for scale-from-zero
      k8s.io/cluster-autoscaler/node-template/label/workload: "airbyte"
    amiFamily: AmazonLinux2
    privateNetworking: true

The node group will be set up to launch a maximum of one m5.2xlarge EC2 nodes to host Airbyte. Adjust the node type and number of nodes depending on your requirements.

Create and scale the node group by executing the commands:

sh
envsubst < nodegroup.yaml.template > nodegroup.yaml

eksctl create nodegroup -f nodegroup.yaml

eksctl scale nodegroup --cluster $CLUSTER_NAME --region $REGION --name $NODEGROUP_NAME --nodes 1

Step 3: Create RDS instance

To create the AWS RDS instance needed to store Airbyte state, first create an RDS subnet group. You will use the two subnets that already exist in the AWS EKS cluster that hosts Yellowbrick to form the group. Run the following:

sh
aws rds create-db-subnet-group \
  --db-subnet-group-name airbyte-subnet-group \
  --db-subnet-group-description "Subnet group for Airbyte RDS" \
  --subnet-ids "$SUBNET1" "$SUBNET2" \
  --tags Key=Project,Value=airbyte \
  --region "$REGION"

Next, create the RDS instance and wait for it to be created:

sh
aws rds create-db-instance \
  --db-instance-identifier airbyte-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --allocated-storage 20 \
  --db-name airbyte \
  --master-username "$RDS_USER" \
  --master-user-password "$RDS_PASSWORD" \
  --vpc-security-group-ids "$SECURITY_GROUP" \
  --db-subnet-group-name airbyte-subnet-group \
  --backup-retention-period 7 \
  --no-publicly-accessible \
  --region "$REGION" \
  --no-cli-pager

  aws rds wait db-instance-available \
  --db-instance-identifier airbyte-db \
  --region "$REGION"

Capture the RDS host name in an environment variable for use in later steps:

sh
export RDS_HOST=$(aws rds describe-db-instances \
  --db-instance-identifier airbyte-db \
  --query "DBInstances[0].Endpoint.Address" \
  --output text)

Step 4: Configure S3 Access for Airbyte

Create an IAM role and access policies that allow Airbyte to read and write from the S3 bucket specified earlier. Create the AWS policy file airbyte-s3-policy.json.template:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAirbyteS3",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::$AIRBYTE_BUCKET_NAME",
        "arn:aws:s3:::$AIRBYTE_BUCKET_NAME/*"
      ]
    }
  ]
}

Create and apply the policy:

sh
envsubst < ./airbyte-s3-policy.json.template > /tmp/airbyte-s3-policy.json

aws iam create-policy \
  --policy-name AirbyteS3Access \
  --policy-document file:///tmp/airbyte-s3-policy.json \
  --region "$REGION"

export AIRBYTE_POLICY_ARN=$(aws iam list-policies --query "Policies[?PolicyName=='AirbyteS3Access'].Arn" --output text)

Create an IAM role for Airbyte, and attach the S3 policy by saving the following into a file trust-policy.json.template:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/$OIDC_PROVIDER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:${NAMESPACE}:airbyte-admin"
        }
      }
    }
  ]
}

This will allow the Airbyte pods (via an Airbyte service account) to assume the associated IAM role, by verifying their identity through the EKS cluster’s OIDC identity provider.

Create the role by executing:

sh
envsubst < ./trust-policy.json.template > /tmp/trust-policy.json

aws iam create-role \
  --role-name AirbyteAdminS3AccessRole \
  --assume-role-policy-document file:///tmp/trust-policy.json

Attach the S3 access policy to the role:

sh
aws iam attach-role-policy \
  --role-name AirbyteAdminS3AccessRole \
  --policy-arn "$AIRBYTE_POLICY_ARN"

Step 5: Install Airbyte using Helm

Create the Kubernetes namespace in which Airbyte will be deployed. Save the following into a file called secrets.yaml.template:

yaml
apiVersion: v1
kind: Secret
metadata:
  name: airbyte-config-secrets
type: Opaque
stringData:
  database-password: ${RDS_PASSWORD}

Create the namespace and save the RDS password secret using:

sh
kubectl create namespace "$NAMESPACE"

envsubst < secrets.yaml.template > secrets.yaml

kubectl apply -f secrets.yaml -n "$NAMESPACE"

Create the Airbyte service account and associate it with the IAM role created earlier:

sh
kubectl create serviceaccount airbyte-admin -n "$NAMESPACE"

kubectl annotate serviceaccount airbyte-admin \
  -n "$NAMESPACE" \
  eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/AirbyteAdminS3AccessRole \
  --overwrite

Create the Kubernetes role airbyte-admin-role and a role binding, which ties together the role with the service account airbyte-admin within the Airbyte namespace. Create a file airbyte-role.yaml.template and insert the following YAML:

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: airbyte-admin-role
  namespace: $NAMESPACE
rules:
  - apiGroups: ["*"]
    resources: ["jobs", "pods", "pods/log", "pods/exec", "pods/attach", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: airbyte-admin-rolebinding
  namespace: $NAMESPACE
subjects:
  - kind: ServiceAccount
    name: airbyte-admin
    namespace: $NAMESPACE
roleRef:
  kind: Role
  name: airbyte-admin-role
  apiGroup: rbac.authorization.k8s.io

Run the following to create the role:

sh
envsubst < "./airbyte-role.yaml.template" | kubectl apply -f -

Finally, install Airbyte by first creating a values.yaml.template file with the content:

yaml
serviceAccount:
  create: false
  name: airbyte-admin

postgresql:
  enabled: false

global:
  database:
    type: external
    secretName: "airbyte-config-secrets"
    host: ${RDS_HOST}
    port: 5432
    database: "airbyte"
    user: ${RDS_USER}
    passwordSecretKey: "database-password"

  storage:
    type: "S3"
    secretName: "airbyte-config-secrets"
    bucket:
      log: ${AIRBYTE_BUCKET_NAME}
      state: ${AIRBYTE_BUCKET_NAME}
      workloadOutput: ${AIRBYTE_BUCKET_NAME}
    s3:
      region: ${REGION}
      authenticationType: instanceProfile

Then execute the following commands:

sh
helm repo add airbyte https://airbytehq.github.io/helm-charts

helm repo update

envsubst < ./values.yaml.template > ./values.yaml

helm install airbyte airbyte/airbyte --namespace "$NAMESPACE" --values ./values.yaml

After several minutes, the Airbyte service will be available. To access the Airbyte application, set up port forwarding to the service:

sh
kubectl -n "$NAMESPACE" port-forward deployment/airbyte-webapp 8080:8080

Access the Airbyte console using the url:

http://localhost:8080/

Step 6: Uninstalling Airbyte and Supporting Resources

Source the environment file:

sh
source ./airbyte-env.sh

Uninstall Airbyte, the namespace and the Kubernetes service account and role:

sh
helm uninstall "$RELEASE_NAME" -n "$NAMESPACE"

kubectl delete rolebinding airbyte-admin-rolebinding -n "$NAMESPACE" --ignore-not-found

kubectl delete role airbyte-admin-role -n "$NAMESPACE" --ignore-not-found

kubectl delete serviceaccount airbyte-admin -n "$NAMESPACE" --ignore-not-found

kubectl delete namespace "$NAMESPACE"

Detach and delete the AWS IAM policies and role:

sh
if aws iam get-role --role-name AirbyteAdminS3AccessRole >/dev/null 2>&1; then
  ATTACHED_POLICIES=$(aws iam list-attached-role-policies \
    --role-name AirbyteAdminS3AccessRole \
    --query "AttachedPolicies[].PolicyArn" \
    --output text)
  for POLICY_ARN in $ATTACHED_POLICIES; do
    aws iam detach-role-policy --role-name AirbyteAdminS3AccessRole --policy-arn "$POLICY_ARN"
  done
fi

aws iam delete-role \
  --role-name AirbyteAdminS3AccessRole

aws iam delete-policy \
  --policy-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/AirbyteS3Access"

Delete the RDS instance and subnet group:

sh
aws rds delete-db-instance \
  --db-instance-identifier airbyte-db \
  --skip-final-snapshot \
  --region "$REGION" \
  --no-cli-pager

aws rds wait db-instance-deleted \
  --db-instance-identifier airbyte-db \
  --region "$REGION"

aws rds delete-db-subnet-group \
  --db-subnet-group-name airbyte-subnet-group \
  --region "$REGION"

Finally, delete the EKS node group:

sh
eksctl delete nodegroup \
  --cluster "$CLUSTER_NAME" \
  --name "$NODEGROUP_NAME" \
  --region "$REGION"