Skip to content

Self-Managed: yb-monitoring

Install yb-monitoring with Helm. This will deploy workloads enabling monitoring for Yellowbrick Data Warehouse. Typically containts loki, grafana, prometheus and fluent-bit.

INFO

If using yellowbrick created storage classes, it is required to install yb-storageclass See Helm: yb-storageclass.

The Yellowbrick Operator also modifies the yb-monitoring helm chart for certain features such as changing the node group tier and enabling log retention on s3 buckets, and it is recommended to take this behaviour into account by reusing values when using automated deployments to manage this chart.

When using the commands or values outlined here, please make appropriate substitutions defined as:

ValueDescription
{image-repo}The container image repository pushed by the Deployer
{namespace}The Kubernetes namespace into which you want to install
{role-arn}when on AWS, the IAM role ARN of the Fluent bit service account
{version}The chart version of loki-stack
{observability-storage}The name of the storage location for the observability, when AWS an S3 bucket name, when Azure a Storage Account name. Must be same as what used when deploying yb-operator chart.
{oidc-provider-arn}When on AWS, the OpenID Connect provider ARN
{oidc-provider}When on AWS, the OpenID Connect provider
{partition}When on AWS, the partition: aws or aws-gov
{storageclass}The general purpose storage class name, e.g. AWS: gp3, Azure: standard, GCP: pd-balanced

Helm Chart

Running the Yellowbrick Deployer will push the Helm charts and container images you need into your cloud environment. For instructions on pushing assets using the Deployer, see the documentation.

Chart name: loki-stack

The get-assets subcommand can be used to find the version of chart name loki-stack, see cli reference.

Install Command

See Authenticating with ECR

bash
helm install loki oci://{image-repo}/loki-stack \
  -n {namespace}  \
  -f values.yaml \
  --version {version}

INFO

Please note that the release name while installing yb-monitoring must be loki and the namespace must be the same as supplied while installing the yb-operator helm chart. We recommend to use a namespace which is different from the one used for yb-operator.

Values

Please note that the node group for yb-monitoring workloads is managed by the yellowbrick operator, and we recommend not changing the node selectors and tolerations in the values file below.

yaml
fluent-bit:
  enabled: true
  image:
    repository: {image-repo}/yellowbrickdata/fluent-bit-plugin-loki
    tag: 2.8.8-13
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: { role-arn }
  nodeSelector: &nodeSelector
    cluster.yellowbrick.io/node_type: yb-mon-standard
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: {role-arn}
  tolerations: &tolerations
  - effect: NoSchedule
    key: cluster.yellowbrick.io/owned
    operator: Equal
    value: "true"
grafana:
  deploymentStrategy:
    type: Recreate
  downloadDashboardsImage:
    repository: {image-repo}/curlimages/curl
    tag: 8.11.1
  image:
    repository: {image-repo}/grafana/grafana
    tag: 12.0.0
  initChownData:
    image:
      repository: {image-repo}/library/busybox
      tag: 1.31.1
  persistence:
    storageClassName: yb-gp3
  sidecar:
    image:
      repository: {image-repo}/kiwigrid/k8s-sidecar
      tag: 1.28.0
  nodeSelector: *nodeSelector
  tolerations: *tolerations
ingress:
  enabled: false
loki:
  extraContainers:
  - command:
    - /bin/sh
    - -c
    - |-
      trap cleanup 15
      cleanup()
      {
          echo "Shutting down the loki pvc monitor"
          exit
      }

      while true; do
          /delete_files_if_low_memory.sh
          sleep 360 &
          PID=$!
          wait $PID
      done;
    env:
    - name: SPACEMONITORING_FOLDER
      value: /data/loki/chunks
    image: {image-repo}/yellowbrickdata/loki-log-trimmer:v5
    name: pvcleanup
    volumeMounts:
    - mountPath: /data
      name: storage
  image:
    repository: {image-repo}/grafana/loki
    tag: 3.5.0
  persistence:
    size: 200Gi
    storageClassName: {storageclass}
  nodeSelector: *nodeSelector
  tolerations: *tolerations
prometheus:
  alertmanager:
    enabled: false
    image:
      repository: {image-repo}/prometheus/alertmanager
      tag: v0.27.0
    nodeSelector: *nodeSelector
    tolerations: *tolerations
  alertmanagerFiles: {}
  configmapReload:
    alertmanager:
      enabled: true
      image:
        repository: {image-repo}/jimmidyson/configmap-reload
        tag: v0.8.0
    prometheus:
      image:
        repository: {image-repo}/jimmidyson/configmap-reload
        tag: v0.8.0
  kube-state-metrics:
    image:
      repository: {image-repo}/kube-state-metrics/kube-state-metrics
      tag: v2.13.0
    nodeSelector: *nodeSelector
    tolerations: *tolerations
  nodeExporter:
    image:
      repository: {image-repo}/prometheus/node-exporter
      tag: v1.8.0
    nodeSelector: *nodeSelector
    tolerations: *tolerations
  processExporter:
    image:
      repository: {image-repo}/ncabatoff/process-exporter
      tag: sha-7ef0b73
    nodeSelector: *nodeSelector
    tolerations: *tolerations
  pushgateway:
    image:
      repository: {image-repo}/prom/pushgateway
      tag: v1.9.0
    nodeSelector: *nodeSelector
    tolerations: *tolerations
  server:
    image:
      repository: {image-repo}/prometheus/prometheus
      tag: v2.49.1
    persistentVolume:
      enabled: true
      size: 100Gi
      storageClass: {storageclass}
    nodeSelector: *nodeSelector
    tolerations: *tolerations
    extraInitContainers:
    - image: {image-repo}/library/busybox:1.31.1
      name: prometheus-wal-cleanup
      command:
      - /bin/sh
      - -c
      - if [ $(du -sm /data/wal | cut -f1) -gt 1024 ]; then rm -rf /data/wal/*; fi
      volumeMounts:
      - mountPath: /data
        name: storage-volume
  serverFiles:
    alerting_rules.yml:
      groups:
      - name: Host alerts
        rules:
        - alert: PVCUtilizationHigh
          annotations:
            message: The persistentVolume used by {{ $labels.persistentvolumeclaim
              }} is {{ $value | humanize }}% utilized. Please check and take appropriate
              action.
            summary: PVC utilization on PVC {{ $labels.persistentvolumeclaim }} is
              high
          expr: 100 * sum(kubelet_volume_stats_used_bytes) by(persistentvolumeclaim)
            /sum(kubelet_volume_stats_capacity_bytes) by (persistentvolumeclaim) >
            90
          for: 5m
          labels:
            severity: warning
        - alert: HostOutOfDiskSpace
          annotations:
            description: |-
              Disk is almost full (< 80% left)
                VALUE = {{ $value | humanize }}
                LABELS: {{ $labels }}
            message: |-
              Disk is almost full (< 80% left)
                VALUE = {{ $value | humanize }}%
                Node Name: {{ $labels.node }}
            summary: Host out of disk space (instance {{ $labels.node }})
          expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
            > 80
          for: 1s
          labels:
            severity: warning

Creating Cloud Infrastructure

AWS

When installing on AWS, an IRSA service account is used. For details on IRSA, please see the AWS documentation.

Create the IAM role:

bash
aws iam create-role \
  --role-name yb-eks-pod-fluent-bit-{instance-name}-{region} \
  --assume-role-policy-document file://trust-policy.json

The trust policy:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "{oidc-provider-arn}"
     },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "{oidc-provider}:sub": "system:serviceaccount:{namespace}:yb-{namespace}-worker-sa"
       }
     }
   }
  ]
}

Create the IAM policy

bash
aws iam put-role-policy \
  --role-name yb-eks-pod-fluent-bit-{instance-name}-{region} \
  --policy-name diags-upload \
  --policy-document file://iam-policy.json

The IAM policy:

json
{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "s3:*"
          ],
          "Resource": "arn:aws:s3:::{observability-storage}/*"
      }
  ]
}

To the values above, add these values in the fluent-bit block and include the ARN of the AWS IAM role in place of {role-arn}:

yaml
fluent-bit:
...
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: { role-arn }
...