- TechGres
- Posts
- Build Your EKS Fleet: A Guide to Kubernetes Cluster Management
Build Your EKS Fleet: A Guide to Kubernetes Cluster Management
Case Study : Zippy Rides is a fast-growing ride sharing startup. As their services and user base expands globally, their Kubernetes infrastructure has become decentralized and complex. They have over 50 EKS clusters running across multiple AWS regions and accounts for development, testing, and production.
This vast cluster fleet has become a challenge to monitor and scale efficiently. Configurations are drifting as teams modify clusters independently, and they lack centralized visibility into resource utilization and costs. Deploying updates is a manual process. Scaling clusters up and down means going into each cluster which is time-consuming.
To address these pain points, Zippy Rides implemented a comprehensive EKS cluster fleet management solution.
Cluster Provisioning:
They leveraged Terraform modules to automate the provisioning of new EKS clusters. This ensured standard configuration for control plane version, node sizes, security groups:
# EKS Cluster Terraform Module
resource "aws_eks_cluster" "example" {
name = "example"
role_arn = aws_iam_role.example.arn
vpc_config {
subnet_ids = [aws_subnet.example1.id, aws_subnet.example2.id]
}
# Other config like version, logging, tags
}
Access and Security:
IAM roles and security groups locked down control plane access. Admission web hooks enforced pod security policies across clusters:
# Require pod security standards
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
namespace: default
spec:
# Restricted privileged pods etc
Observability:
They used Prometheus Operator to collect metrics and Granfana for cluster monitoring dashboards:
# Prometheus Operator Helm Chart
helm install prometheus-operator stable/prometheus-operator \
--set prometheus.createCustomResource=false
# View utilization graphs
grafana-server --config=/etc/grafana/grafana.ini
Multi-Cluster Management:
To manage multiple EKS clusters spread across regions, Zippy Rides leveraged Rancher's multi-cluster management capabilities:
Rancher provided a single control plane to manage all their Kubernetes clusters (EKS and self-hosted) from one place
Role-based access control enabled granting users access to specific clusters and namespaces
Rancher fleet tracking gave insight into their inventory of clusters across environments
Health checks monitored cluster and node health and reliability centrally
# Import existing EKS cluster into Rancher
rancher cluster create \
--name prod-useast1 \
--eksConfig \
--region us-east-1
Rancher enabled deploying applications across their cluster fleet through GitOps pipelines
Optimization:
To optimize costs and resources across their large fleet, they implemented autoscaling, right-sizing, and spot instances:
Cluster-autoscaler tool scaled cluster nodes up or down based on utilization
Scheduling optimizing tools like Kubecost helped right-size clusters
Rancher tuned resource quotas and limits by namespace and cluster
Underutilized clusters were resized to a smaller instance type
Non-prod clusters leveraged EC2 spot instances for cheaper computing
# Scale EKS node groups based on CPU
cluster-autoscaler --nodes=2:10:k8s-worker-nodes-xyz \
--scale-cpu-utilization=0.5
By leveraging these multi-cluster management and optimization best practices, Zippy Rides was able to efficiently operate their large and growing Kubernetes footprint across regions and environments.