TechGres
Posts
Build Your EKS Fleet: A Guide to Kubernetes Cluster Management

Build Your EKS Fleet: A Guide to Kubernetes Cluster Management

SG
July 28, 2023

Case Study : Zippy Rides is a fast-growing ride sharing startup. As their services and user base expands globally, their Kubernetes infrastructure has become decentralized and complex. They have over 50 EKS clusters running across multiple AWS regions and accounts for development, testing, and production.

This vast cluster fleet has become a challenge to monitor and scale efficiently. Configurations are drifting as teams modify clusters independently, and they lack centralized visibility into resource utilization and costs. Deploying updates is a manual process. Scaling clusters up and down means going into each cluster which is time-consuming.

To address these pain points, Zippy Rides implemented a comprehensive EKS cluster fleet management solution.

Cluster Provisioning:

They leveraged Terraform modules to automate the provisioning of new EKS clusters. This ensured standard configuration for control plane version, node sizes, security groups:

# EKS Cluster Terraform Module

resource "aws_eks_cluster" "example" {
  name = "example"
  role_arn = aws_iam_role.example.arn

  vpc_config {
    subnet_ids = [aws_subnet.example1.id, aws_subnet.example2.id] 
  }

  # Other config like version, logging, tags
}

Access and Security:

IAM roles and security groups locked down control plane access. Admission web hooks enforced pod security policies across clusters:

# Require pod security standards 
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
  namespace: default
spec:
  # Restricted privileged pods etc

Observability:

They used Prometheus Operator to collect metrics and Granfana for cluster monitoring dashboards:

# Prometheus Operator Helm Chart
helm install prometheus-operator stable/prometheus-operator \
  --set prometheus.createCustomResource=false

# View utilization graphs
grafana-server --config=/etc/grafana/grafana.ini

Multi-Cluster Management:

To manage multiple EKS clusters spread across regions, Zippy Rides leveraged Rancher's multi-cluster management capabilities:

Rancher provided a single control plane to manage all their Kubernetes clusters (EKS and self-hosted) from one place
Role-based access control enabled granting users access to specific clusters and namespaces
Rancher fleet tracking gave insight into their inventory of clusters across environments
Health checks monitored cluster and node health and reliability centrally

# Import existing EKS cluster into Rancher
rancher cluster create \
  --name prod-useast1 \
  --eksConfig \
  --region us-east-1

Rancher enabled deploying applications across their cluster fleet through GitOps pipelines

Optimization:

To optimize costs and resources across their large fleet, they implemented autoscaling, right-sizing, and spot instances:

Cluster-autoscaler tool scaled cluster nodes up or down based on utilization
Scheduling optimizing tools like Kubecost helped right-size clusters
Rancher tuned resource quotas and limits by namespace and cluster
Underutilized clusters were resized to a smaller instance type
Non-prod clusters leveraged EC2 spot instances for cheaper computing

# Scale EKS node groups based on CPU 
cluster-autoscaler --nodes=2:10:k8s-worker-nodes-xyz \
                  --scale-cpu-utilization=0.5

By leveraging these multi-cluster management and optimization best practices, Zippy Rides was able to efficiently operate their large and growing Kubernetes footprint across regions and environments.