Create a reasonably scalable k8s cluster with prometheus, istio, hpa, keda and karpenter.

Chris Haessig
8 min readSep 9, 2022

Since I was laid off at Gemini, I am bored and thought a post should be in order to explain how to create a kubernetes cluster that can scale itself ( looking for my next role BTW ) . We will start by using terraform to launch a EKS cluster. Then will launch a nginx webserver that will be able to grow based on metrics we pull from prometheus ( metrics will come from the istio ingress controller ) that we will also install.

Once this logic is implemented, you can handle millions of requests a second, basically it’s throwing money at the issue ( more instances == more cost ). New resources will come up when needed and destroy when not. Of course the real world has more variables but let’s pretend thats true.

Launched a EKS Cluster

Using terraform, we will create a EKS cluster in AWS. We also configure IAM and install karpenter via helm.

module "eks" {source          = "terraform-aws-modules/eks/aws"
version = "<18"
cluster_version = "1.28"
cluster_name = var.cluster_name
vpc_id = module.vpc.vpc_id
subnets = module.vpc.private_subnets
enable_irsa = true
# Only need one node to get Karpenter up and running
worker_groups = [
{
instance_type = "t3a.medium"
asg_max_size = 1
}
]
}
resource "helm_release" "karpenter" {
depends_on = [module.eks.kubeconfig]
namespace = "karpenter"
create_namespace = true
name = "karpenter"
repository = "https://charts.karpenter.sh"
chart = "karpenter"
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = module.iam_assumable_role_karpenter.iam_role_arn
}
set {
name = "clusterName"
value = var.cluster_name
}
set {
name = "clusterEndpoint"
value = module.eks.cluster_endpoint
}
set {
name = "replicas"
value = 1
}
set {
name = "aws.defaultInstanceProfile"
value = aws_iam_instance_profile.karpenter.name
}

}
data "aws_iam_policy" "ssm_managed_instance" {
arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
resource "aws_iam_role_policy_attachment" "karpenter_ssm_policy" {
role = module.eks.worker_iam_role_name
policy_arn = data.aws_iam_policy.ssm_managed_instance.arn
}
resource "aws_iam_instance_profile" "karpenter" {
name = "KarpenterNodeInstanceProfile-${var.cluster_name}"
role = module.eks.worker_iam_role_name
}
module "iam_assumable_role_karpenter" {
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
version = "4.7.0"
create_role = true
role_name = "karpenter-controller-${var.cluster_name}"
provider_url = module.eks.cluster_oidc_issuer_url
oidc_fully_qualified_subjects = ["system:serviceaccount:karpenter:karpenter"]
}
resource "aws_iam_role_policy" "karpenter_controller" {
name = "karpenter-policy-${var.cluster_name}"
role = module.iam_assumable_role_karpenter.iam_role_name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:RunInstances",
"ec2:CreateTags",
"iam:PassRole",
"ec2:TerminateInstances",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeAvailabilityZones",
"ssm:GetParameter"
]
Effect = "Allow"
Resource = "*"
},
]
})
}
provider "aws" {
region = "us-east-1"
}
variable "cluster_name" {
description = "The name of the cluster"
type = string
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = var.cluster_name
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
one_nat_gateway_per_az = false
private_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "owned"
}
}

Whats important here is the IAM policy, this will let karpenter to read metadata and create the necessary resources.

terraform plan --var cluster_name="chris"
terraform apply --var cluster_name="chris"
aws eks update-kubeconfig --name chris

Make sure it all came up, and you can use kubectl to talk to the cluster etc

Setting up istio and prometheus

So now we have an empty cluster ready to go, we first want to install prometheus, I really like the prometheus operator which can be found here. It allows us to use the kubernetes CRDs to configure targets, define scrape endpoints, control prometheus server and a lot of of cool things all from the CLI. This is the way to go, which running prometheus on k8s.

Install with helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring

After a couple minutes you can see a couple stateful sets, deployments and daemon sets that helm created. We can checkout what the config of prometheus is by accessing the Prometheus CRD.

kubectl get pods -n monitoringkubectl get Prometheus -n monitoring

NOTE: ( well at least with my test cluster ) , prometheus did not configure volumes, so information will be lost on reboot.

Prometheus is now getting metrics from k8s api, nodes and other components. You can also import metrics from other applications ( like istio ! ) . We want to also get metrics from the istio sidecar, this way we can scale on them.

We install istio on the cluster , tell it to inject the sidecar into any pod that starts on the web namespace. ( You may want to configure istio with custom settings, but we will use the basics for now )

istioctl manifest install

After everything comes up, will apply the same Gateway and Virtual Service we used in past posts. If you are not familiar with this, we just tell istio to send traffic to the nginx pod if the hostname is chris.somecompany.com.

---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: gateway
namespace: istio-system
spec:
selector:
app: istio-ingressgateway
servers:
- port:
number: 8080
name: http
protocol: HTTP
hosts:
- "chris.somecompany.com"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: nginx
namespace: web
spec:
gateways:
- istio-system/gateway
hosts:
- "chris.somecompany.com"
http:
- route:
- destination:
host: nginx.web.svc.cluster.local

Create and label the web namespace

kubectl create ns web
kubectl label namespace default istio-injection=enabled --overwrite

Create nginx pod and service

kubectl -n web create deploy nginx --image=nginx --port 80
kubectl -n web expose deploy nginx --port 80

We can see the nginx pod started with two containers. This will be nginx and the istio sidecar container.

kubectl -n web get podsNAME                     READY   STATUS    RESTARTS   AGE
nginx-6c8b449b8f-wkvd6 2/2 Running 0 6s

We also want to define a Service Entry and Pod Monitor to tell prometheus to scrape metrics from istio. Of course you can label these anyway you want but below is how I did it ( well I took it mostly from the istio docs ) . Apply to cluster.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: envoy-stats-monitor
namespace: istio-system
labels:
monitoring: istio-proxies
release: prometheus
spec:
selector:
matchExpressions:
- {key: istio-prometheus-ignore, operator: DoesNotExist}
namespaceSelector:
any: true
jobLabel: envoy-stats
podMetricsEndpoints:
- path: /stats/prometheus
interval: 15s
relabelings:
- action: keep
sourceLabels: [__meta_kubernetes_pod_container_name]
regex: "istio-proxy"
- action: keep
sourceLabels: [__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape]
- sourceLabels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
targetLabel: __address__
- action: labeldrop
regex: "__meta_kubernetes_pod_label_(.+)"
- sourceLabels: [__meta_kubernetes_namespace]
action: replace
targetLabel: namespace
- sourceLabels: [__meta_kubernetes_pod_name]
action: replace
targetLabel: pod_name
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-component-monitor
namespace: istio-system
labels:
monitoring: istio-components
release: prometheus
spec:
jobLabel: istio
targetLabels: [app]
selector:
matchExpressions:
- {key: istio, operator: In, values: [pilot]}
namespaceSelector:
any: true
endpoints:
- port: http-monitoring
interval: 15s

Send some traffic through the ingress gateway, we should see nginx title header return. This tells us the packet hit the nginx container from the ingress gateway, incrementing the count.

# Create foreground tunnel  
kubectl -n istio-system port-forward svc/istio-ingressgateway 8080

Send off a http request.

curl -H 'Host: chris.somecompany.com' localhost:8080 | grep --color -i title<title>Welcome to nginx!</title>

Connect to prometheus, let’s look at targets

kubectl -n monitoring port-forward svc/prometheus-operated 9090

We can see we have an envoy-stats-monitor with valid targets, prometheus is finding our nginx container and metric endpoints. These metrics are coming from the istio sidecar that runs per pod, nice!

Querying the istio_requests_total metric with the nginx lable gives us a count.

sum(istio_requests_total{destination_app="nginx"})

We get back 8, which means the nginx container has been accessed 8 times.

We now have istio and prometheus setup, we have the metrics being processed, we can scale on those !

Install keda and define the HPA

We will install keda, which is an open source tool we can add to kubernetes to respond to events ( trigger events from prometheus metrics in this context ) .

Install keda via helm

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
kubectl create namespace keda
helm install keda kedacore/keda --namespace keda

We then will define a ScaledObject CRD which will monitor the istio_requests_total metric from prometheus. If the count goes over 10 then an increase to our pod replicas will be make.

Create the CRD

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: nginx
namespace: web
spec:
scaleTargetRef:
kind: Deployment
name: nginx
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 30
pollingInterval: 1
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring:9090
metricName: istio_requests_total_keda
query: |
sum(istio_requests_total{destination_app="nginx"})
threshold: "10"

A HPA was also created, and will be what actually changes the pod count

kubectl -n web get hpa keda-hpa-nginx

Generate some traffic

curl -H 'Host: chris.somecompany.com' localhost:8080 | grep --color -i title<title>Welcome to nginx!</title>

Once we generate enough traffic, we can see keda increases the replicas count on the HPA, and now we have more pod replicas.

We just increase pod replicas by monitoring metrics from prometheus !

# New HPA valueskubectl get hpa -n webNAME             REFERENCE          TARGETS          MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-nginx Deployment/nginx 8667m/10 (avg) 1 10 3 11m
kubectl get pods -n webNAME READY STATUS RESTARTS AGE
nginx-6c8b449b8f-7mr4x 2/2 Running 0 3m29s
nginx-6c8b449b8f-lfmdv 0/2 PodInitializing 0
nginx-6c8b449b8f-wkvd6 2/2 Running 0 41m

Describing the metric you can see why it scaled the way it did.

kubectl describe hpa keda-hpa-nginx  -n webNormal  SuccessfulRescale  6m18s  horizontal-pod-autoscaler  New size: 3; reason: external metric s0-prometheus-istio_requests_total_keda(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}) above target

Karpenter

So we installed karpenter when we launched the cluster, is has been running happily but we have not configured it. If you look at the auto scale group ( ASG ) that we defined with terraform, it had a max of 1, so 1 ec2 instance should be running.

The HPA will add more pods based on load, but what if we run out of room on the node ? This is where we configure karpenter.

We want to define some sizes we want, and tell it what cluster to use with tags and apply.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
namespace: default
spec:
limits:
resources:
cpu: 1k
memory: 1000Gi
provider:
apiVersion: extensions.karpenter.sh/v1alpha1
kind: AWS
securityGroupSelector:
kubernetes.io/cluster/${var.cluster_name}: owned
subnetSelector:
kubernetes.io/cluster/${var.cluster_name}: owned
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.2xlarge
- c5.large
- c5.xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- amd64
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 172800

So if to many pods get added, karpenter will launched more nodes with the chris2022090904481313660000000c launch profile. Of course you can add many more groups, and settings, but this will work for now.

Once we apply, we can continue to generate traffic which will increase the count which will tell the HPA to add more replicas. Pods will go into a pending state as they have nowhere to go.

kubectl get pods | grep Pend
nginx-6799fc88d8-2rplf 0/1 Pending 0 32s
nginx-6799fc88d8-8ln6v 0/1 Pending 0 32s
nginx-6799fc88d8-hhhgn 0/1 Pending 0 32s
nginx-6799fc88d8-mfh7v 0/1 Pending 0 32s
nginx-6799fc88d8-nmdtj 0/1 Pending 0 32s
nginx-6799fc88d8-rjnfx 0/1 Pending 0 32s
nginx-6799fc88d8-szgnd 0/1 Pending 0 32s
nginx-6799fc88d8-t9p6s 0/1 Pending 0 32s

But will then all start running once the new nodes come up.

nginx-6799fc88d8-cc8x5   1/1     Running        0          24s
nginx-6799fc88d8-cpzx6 1/1 Running 0 46s
nginx-6799fc88d8-dlz4d 1/1 Running 0 24s
nginx-6799fc88d8-gwdrh 1/1 Running 0 24s
nginx-6799fc88d8-hg4s6 1/1 Running 0 24s

If it was to scale down, then nodes would also be destroyed.

Summary

So we created a cluster that will add nodes dynamically based on resources with karpenter, we can control the pod count by looking at prometheus metrics from istio. We just created a nice autoscaling cluster.

Profit ?

--

--