Karpenter: Run your Workloads upto 80% Off using Spot with AKS

Karpenter: Run your Workloads upto 80% Off using Spot with AKS

Introduction

At last year's KubeCon North America, Microsoft announced the adoption of Karpenter in Azure Kubernetes Service (AKS) as an alternative to the Cluster Autoscaler (CA), referred to as Node Autoprovisioning (NAP). While Cluster Autoscaler has been the default node scaler in AKS/Kubernetes, there have been significant challenges that led to the adoption of Karpenter. This post delves into these challenges and explores how Karpenter addresses them.

Challenges with Cluster Autoscaler

Here is Node Autosclaing flow chart for Cluster-Autoscaler

  • Limited to VMSS Groups: Cluster Autoscaler can only operate with Virtual Machine Scale Sets (VMSS) in AKS. Each VMSS consists of a specific group of VM instances with a specific VM SKU, hardware, and CPU:Memory ratio (e.g., Standard D4sv5 with 4 CPUs and 16 GB RAM).

    2. Node Latency: CA triggers the node pool API, which calls the VMSS instance API. This scaling process has latency, taking over a minute for a node to be ready in AKS.

    3. Node Pool Constraints: When deploying new pods, if the existing node capacity is exhausted, CA attempts to spin up a new node of the same VMSS SKU type. If that instance is unavailable, pods remain in a pending state.

    4. Scalability Limitations: CA can only scale up based on specific node pool SKU VMSS availability. It cannot leverage the capacity of other VM SKUs even if they have available resources.

Introducing Karpenter (Node Autoprovisioning)

Karpenter is an efficient node autoscaler for Kubernetes clusters, designed to optimize performance and cost. It can scale up and down worker nodes faster than Cluster Autoscaler and can launch appropriate individual nodes without creating traditional node groups in AKS.

  • Key Features of Karpenter:

    • Efficiency: Faster scaling of Kubernetes nodes.

    • Flexibility: Launches nodes without needing VMSS.

    • Cost Optimization: Reduces overall costs and helps with patching of node images and Kubernetes versions.

    • Nodepool YAML based config which defined what types of nodes it can provision

Handling Disruptions

Disruption Controller responsible for Terminating/Replacing nodes in kubernetes cluster.
its uses one of 3 automated methods to finalise which nodes to handle via Disruption controller

  • Expiration: Karpenter will mark nodes as expired and disrupt them after they have lived a set number of seconds. this parameters act as TTL for k8s nodes
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 300s
  • Consolidateafter: it used to configure disrupiton interval,amount of time it should wait before considering disruption cycle again

  • Consolidation: It operates to actively reduce cluster cost by analyzing nodes
    Consolidation policy has two modes
    a)When Empty: Karpenter will only disrupt nodes with no workloads pods
    b)Whenunderutilized: It will attempt to reduce/replace nodes when underutilised

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: ondemand
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Enable NAP(Karpenter) on AKS

There are few pre requisites to enable NAP on AKS

  • Install Az CLI with preview extension of version greater than 0.5.17

  • Regsiter NAP provides called "NodeAutoProvisioningPreview"

  • AKS with network configuration as Cilium + Overlay

Enable NAP on existing AKS cluster
Make sure existing AKS cluster has 'Azure' network plugin with Cilium as Network Policy. Key thing in this command is feature flag '--node-provisioning-mode Auto', which set NAP as default Node Autoscaler

az aks update --name aksclustername --resource-group rgname --node-provisioning-mode Auto

Deploy NAP with new AKS cluster

az aks create --name aksclustername--resource-group rgname--node-provisioning-mode Auto --network-plugin azure --network-plugin-mode overlay --network-dataplane cilium

Verify Karpenter Enablement:

kubectl api-resources | grep -e aksnodeclasses -e nodeclaims -e nodepools

aksnodeclasses                     aksnc,aksncs                        karpenter.azure.com/v1alpha2           false        AKSNodeClass
nodeclaims                                                             karpenter.sh/v1beta1                   false        NodeClaim
nodepools                                                              karpenter.sh/v1beta1                   false        NodePool

Disabling Cluster-Autoscaler

To switch from Cluster-Autoscaler to Karpenter, disable Cluster-Autoscaler on your AKS cluster:

az aks update --name aksclustername --resource-group aksrg --disable-cluster-autoscaler

Deploying a Sample Application

To see Node-Autoprovisioning in action, deploy a sample application:

osama [ ~ ]$ kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-default-h2jxh                   Ready    agent   35m     v1.27.9
aks-nodepool1-41633911-vmss000000   Ready    agent   3d19h   v1.27.9

Scale replicas of Vote Application to trigger scale out events

osama [ ~ ]$ kubectl scale deployment azure-vote-front --replicas=12 -n karpenter-demo-ns
^[[Adeployment.apps/azure-vote-front scaled
osama [ ~ ]$ kubectl scale deployment azure-vote-back --replicas=12 -n karpenter-demo-ns
deployment.apps/azure-vote-back scaled

Verify auto scaling of nodes by reading via karpenter using below kubectl cmd

kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 10
NAMESPACE           LAST SEEN   TYPE     REASON                  OBJECT                                  MESSAGE
default             50m         Normal   Unconsolidatable        nodeclaim/default-95f54                 SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default             50m         Normal   Unconsolidatable        node/aks-default-95f54                  SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default             38m         Normal   DisruptionBlocked       nodepool/default                        No allowed disruptions due to blocking budget
default             5m33s       Normal   Unconsolidatable        nodeclaim/default-h2jxh                 Can't remove without creating 2 candidates
default             5m33s       Normal   Unconsolidatable        node/aks-default-h2jxh                  Can't remove without creating 2 candidates
default             2m12s       Normal   DisruptionBlocked       nodepool/system-surge                   No allowed disruptions due to blocking budget
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-bnq7p   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-gbwk6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-l2bgj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-nvc56   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-22glj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-sxdl6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-t69w4   Pod should schedule on: nodeclaim/default-mrh7w

Customise Karpenter Config

Karpenter leverage new resource type in kubernetes Kind i.e. Nodepools

  • Customise Nodepools: Specific specific VM series or VM family or even Specific CPU or Memory ratio.

  • Select node based on features sets like GPU enable or Network Acceleration

  • Defined Archiecture of CPU type either ARM or AMD based on capablity of specfic workload

  • Architect your nodes for resiliency by configure zone topology

  • Limit numbers of CPU & Memory could be utilised from nodes on nodelevel

Here is default Nodepool Yaml for karpenter(NAP), Which has confiuration on Node SKU types and Capacity, Also limit on nodes CPU:Memory along with Weight incase of Multiple nodepools

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 10s
  template:
    spec:
      nodeClassRef:
        name: default

      # Requirements that constrain the parameters of provisioned nodes.
      # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
      # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
      # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - ondemand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - E
        - D
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_E2s_v5
        - Standard_D4s_v3
limits:
  cpu: "1000"
  memory: 1000Gi
weight: 100

Using Spot Node with Karpenter

  • Add toleration in Sample AKS-Vote application i.e. "karpenter.sh/disruption:NoSchedule" which comes as default in spot node when provision with AKS Cluster

  • Please refer my github repo for Application yaml and sample nodepool config

      spec:
            nodeSelector:
              "kubernetes.io/os": linux
            tolerations:
            - key: "kubernetes.azure.com/scalesetpriority"
              operator: "Equal"
              value: "spot"
              effect: "NoSchedule"
            containers:
            - name: azure-vote-front
              image: mcr.microsoft.com/azuredocs/azure-vote-front:v1
    
  • Scale down your application replicas to allow Karpenter to evict existing on-demand nodes and replace them with Spot nodes:

      osama [ ~/karpenter ]$ kubectl get nodes
      NAME                                STATUS   ROLES   AGE     VERSION
      aks-nodepool1-41633911-vmss000000   Ready    agent   3d21h   v1.27.9
      aks-nodepool1-41633911-vmss00000b   Ready    agent   24m     v1.27.9
    
      osama [ ~/karpenter ]$ kubectl get pods -n karpenter-demo-ns -o wide
      No resources found in karpenter-demo-ns namespace.
    
      osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-back --replicas=10 -n karpenter-demo-ns
      deployment.apps/azure-vote-back scaled
      osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-front --replicas=10 -n karpenter-demo-ns
      deployment.apps/azure-vote-front scaled
      osama [ ~/karpenter ]$
    
  • Deploy and scale vote application replicas so that karpenter spins up spot nodes based on nodepool configuration and schedule pods after toleration validation on spot

  • Karpenter spins up new spot nodes and Nominate that node for sceduling sample vote-app

    
      osama [ ~/karpenter ]$ kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp'
      NAMESPACE           LAST SEEN   TYPE      REASON                       OBJECT                                    MESSAGE
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-pz8sp      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-ckdcq      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-v9nqj      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-vswvs      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-lnxmp      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-jc2jz      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-hwnbh      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-r7msb      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-96lm9      Pod should schedule on: nodeclaim/default-52gbg
      karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-5qcvk      Pod should schedule on: nodeclaim/default-52gbg
      default             1s          Normal    DisruptionLaunching          nodeclaim/default-bkz6c                   Launching NodeClaim: Expiration/Replace
      default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-bkz6c                   Waiting on readiness to continue disruption
      default             1s          Normal    DisruptionBlocked            nodepool/system-surge                     No allowed disruptions due to blocking budget
      default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-5vp7x                   Waiting on readiness to continue disruption
      default             1s          Normal    DisruptionLaunching          nodeclaim/default-5vp7x                   Launching NodeClaim: Expiration/Replace
    

    Configuring Multiple NodePools

  • To configure separate NodePools for Spot and On-Demand capacity:

    Spot nodes configure with E series VM "Standard E2s_v5" and OnDemand with D series VM as "Standard_D4s_v5"

  • In multi-nodepool scenario each nodepool needs to be configured with 'Weight' attribute, nodepool with highest weight would be priotized over another, here we have Spot node with weight:100 and ondemand with weight:60

osama [ ~ ]$ kubectl get nodepool default -o yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - B
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_B2s_v2
  weight: 100
  • If we do not specify an explicit SKU name, Karpenter will consider the entire VM series.

  • To validate that the sample VoteApp is running on Spot nodes, use the following commands:

  • The output should indicate that the nodes are of capacity type "spot":

      osama [ ~ ]$ kubectl get pods -n karpenter-demo-ns -o wide
      NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
      azure-vote-back-687ddb67bd-w7ghm    1/1     Running   0          63m   10.244.3.11    aks-default-5cr5f   <none>           <none>
      azure-vote-front-6855444955-64558   1/1     Running   0          63m   10.244.3.168   aks-default-5cr5f   <none>           <none>
      osama [ ~ ]$ kubectl describe node aks-default-5cr5f | grep karpenter.sh
                          karpenter.sh/capacity-type=spot
                          karpenter.sh/initialized=true
                          karpenter.sh/nodepool=default
                          karpenter.sh/registered=true
                          karpenter.sh/nodepool-hash: 12393960163388511505
                          karpenter.sh/nodepool-hash-version: v2
    

    Simulating Spot Node Eviction

    To test the spot eviction scenario, simulate a spot eviction using the Azure CLI:

      osama [ ~ ]$ az vm simulate-eviction --resource-group MC_aks-lab_aks-karpenter_eastus --name aks-default-5cr5f
      osama [ ~ ]$ date
      Tue May 21 06:20:02 PM IST 2024
    
  • Monitor the availability of your VoteApp using a simple curl command:

      while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done
    
  • After running the spot simulation, the existing node will be marked for termination, and a new Spot node will be created to schedule the VoteApp pods. Within less than a minute, the VoteApp should start responding with HTTP 200 status codes.

  •     root@MININT-8C81HDE:/home/osamaex while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done
        Tue May 21 18:20:04 IST 2024 > GET / HTTP/1.1
        < HTTP/1.1 200 OK
        HTTP 200
        Tue May 21 18:20:07 IST 2024 > GET / HTTP/1.1
        < HTTP/1.1 200 OK
        HTTP 200
        Tue May 21 18:20:09 IST 2024 > GET / HTTP/1.1
        < HTTP/1.1 200 OK
        HTTP 200
        Tue May 21 18:20:12 IST 2024 HTTP 000  $Failure-Alert
        Tue May 21 18:21:14 IST 2024 > GET / HTTP/1.1
        < HTTP/1.1 200 OK                      $Successful-Response
        HTTP 200
        Tue May 21 18:22:58 IST 2024 > GET / HTTP/1.1
        < HTTP/1.1 200 OK
        HTTP 200
    
  • Check the events logged by Karpenter:

  •     kuctl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp'
    
  • Results of events logged by karpenter to replace spot noded with ondemand

  •     osama [ ~ ]$ 
        NAMESPACE           LAST SEEN   TYPE      REASON           OBJECT                                  MESSAGE
        default             23s         Warning   FailedDraining   node/aks-default-5cr5f                  Failed to drain node, 10 pods are waiting to be evicted
        karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-back-687ddb67bd-w7ghm    Evicted pod
        karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-front-6855444955-64558   Evicted pod
        karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-back-687ddb67bd-tb2pv    Pod should schedule on: nodeclaim/default-6zkkl
        karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-front-6855444955-7wzss   Pod should schedule on: nodeclaim/default-6zkkl
    
  • Verify that the pods are running on the new Spot node:

      kubectl get pods -n karpenter-demo-ns -o wide
      NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
      azure-vote-back-687ddb67bd-tb2pv    1/1     Running   0          18m   10.244.2.103   aks-default-6zkkl   <none>           <none>
      azure-vote-front-6855444955-7wzss   1/1     Running   0          18m   10.244.2.47    aks-default-6zkkl   <none>           <none>
    

Save Cost by utilizing Reserved Instance VM's

  • NodePool configuration allows you to specify different VM series along with multiple VM SKUs. Create a separate NodePool with the highest weight value and specify all Reserved Instance VM SKU families or explicit SKU names using the karpenter.azure.com/sku-name or karpenter.azure.com/sku-familyparameter.

         spec:
            nodeClassRef:
              name: default
            requirements:
            - key: kubernetes.io/arch
              operator: In
              values:
              - amd64
            - key: kubernetes.io/os
              operator: In
              values:
              - linux
            - key: karpenter.sh/capacity-type
              operator: In
              values:
              - on-demand
            - key: karpenter.azure.com/sku-family
              operator: In
              values:
              - D
            - key: karpenter.azure.com/sku-name
              operator: In
              values:
              - [Standard_D2s_v3, Standard_D4s_v3, Standard_D8s_v3, Standard_D16s_v3, Standard_D32s_v3, Standard_D64s_v3, Standard_D96s_v3]
        weight: 90
    

Conclusion

The adoption of Karpenter in AKS signifies a major advancement in node scaling efficiency, flexibility, and cost optimization. By addressing the limitations of the Cluster Autoscaler and introducing dynamic, rapid provisioning of nodes, Karpenter provides a robust solution for managing Kubernetes clusters. Its flexibility in handling different VM types, faster scaling capabilities, and cost optimization make it a valuable addition to Kubernetes cluster management. By leveraging Karpenter, organizations can achieve more responsive and cost-effective Kubernetes deployments.