Deprovisioning

Understand different ways Karpenter deprovisions nodes

Karpenter sets a Kubernetes finalizer on each node it provisions. The finalizer specifies additional actions the Karpenter controller will take in response to a node deletion request. These include:

  • Marking the node as unschedulable, so no further pods can be scheduled there.
  • Evicting all pods other than daemonsets from the node.
  • Terminating the instance from the cloud provider.
  • Deleting the node from the Kubernetes cluster.

Methods

There are both automated and manual ways of deprovisioning nodes provisioned by Karpenter:

  • Provisioner Deletion: Nodes are considered to be “owned” by the Provisioner that launched them. Karpenter will gracefully terminate nodes when a provisioner is deleted.
  • Emptiness: Karpenter notes when the last workload (non-daemonset) pod stops running on a node. From that point, Karpenter waits the number of seconds set by ttlSecondsAfterEmpty in the provisioner, then Karpenter requests to delete the node. This feature can keep costs down by removing nodes that are no longer being used for workloads.
  • Expiration: Karpenter requests to delete the node after a set number of seconds, based on the provisioner ttlSecondsUntilExpired value, from the time the node was provisioned. One use case for node expiry is to handle node upgrades. Old nodes (with a potentially outdated Kubernetes version or operating system) are deleted, and replaced with nodes on the current version (assuming that you requested the latest version, rather than a specific version).
  • Consolidation: Karpenter works to actively reduce cluster cost by identifying when nodes can be removed as their workloads will run on other nodes in the cluster and when nodes can be replaced with cheaper variants due to a change in the workloads.
  • Interruption: If enabled, Karpenter will watch for upcoming involuntary interruption events that could affect your nodes (health events, spot interruption, etc.) and will cordon, drain, and terminate the node(s) ahead of the event to reduce workload disruption.
  • Drift: Karpenter will deprovision nodes that have drifted from their desired specification. Once the node is annotated as drifted, Karpenter will deprovision the nodes and provision replacement nodes with the correct provisioning requirements when needed. Currently, Karpenter will only automatically mark nodes as drifted in the case of a drifted AMI.
  • Node Deletion: You could use kubectl to manually remove a single Karpenter node:

    # Delete a specific node
    kubectl delete node $NODE_NAME
    
    # Delete all nodes owned any provisioner
    kubectl delete nodes -l karpenter.sh/provisioner-name
    
    # Delete all nodes owned by a specific provisioner
    kubectl delete nodes -l karpenter.sh/provisioner-name=$PROVISIONER_NAME
    

Whether through node expiry or manual deletion, Karpenter seeks to follow graceful termination procedures as described in Kubernetes Graceful node shutdown documentation. If the Karpenter controller is removed or fails, the finalizers on the nodes are orphaned and will require manual removal.

Consolidation

Karpenter has two mechanisms for cluster consolidation:

  • Deletion - A node is eligible for deletion if all of its pods can run on free capacity of other nodes in the cluster.
  • Replace - A node can be replaced if all of its pods can run on a combination of free capacity of other nodes in the cluster and a single cheaper replacement node.

Consolidation has three mechanisms that are performed in order to attempt to identify a consolidation action:

  1. Empty Node Consolidation - Delete any entirely empty nodes in parallel
  2. Multi-Node Consolidation - Try to delete two or more nodes in parallel, possibly launching a single replacement that is cheaper than the price of all nodes being removed
  3. Single-Node Consolidation - Try to delete any single node, possibly launching a single replacement that is cheaper than the price of that node

It’s impractical to examine all possible consolidation options for multi-node consolidation, so Karpenter uses a heuristic to identify a likely set of nodes that can be consolidated. For single-node consolidation we consider each node in the cluster individually.

When there are multiple nodes that could be potentially deleted or replaced, Karpenter choose to consolidate the node that overall disrupts your workloads the least by preferring to terminate:

  • nodes running fewer pods
  • nodes that will expire soon
  • nodes with lower priority pods

If consolidation is enabled, Karpenter periodically reports events against nodes that indicate why the node can’t be consolidated. These events can be used to investigate nodes that you expect to have been consolidated, but still remain in your cluster.

Events:
  Type     Reason                   Age                From             Message
  ----     ------                   ----               ----             -------
  Normal   Unconsolidatable         66s                karpenter        pdb default/inflate-pdb prevents pod evictions
  Normal   Unconsolidatable         33s (x3 over 30m)  karpenter        can't replace with a cheaper node

Interruption

If interruption-handling is enabled, Karpenter will watch for upcoming involuntary interruption events that would cause disruption to your workloads. These interruption events include:

  • Spot Interruption Warnings
  • Scheduled Change Health Events (Maintenance Events)
  • Instance Terminating Events
  • Instance Stopping Events

When Karpenter detects one of these events will occur to your nodes, it automatically cordons, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to compute disruption. This enables scenarios where the terminationGracePeriod for your workloads may be long or cleanup for your workloads is critical, and you want enough time to be able to gracefully clean-up your pods.

Karpenter enables this feature by watching an SQS queue which receives critical events from AWS services which may affect your nodes. Karpenter requires that an SQS queue be provisioned and EventBridge rules and targets be added that forward interruption events from AWS services to the SQS queue. Karpenter provides details for provisioning this infrastructure in the CloudFormation template in the Getting Started Guide.

To enable the interruption handling feature flag, configure the karpenter-global-settings ConfigMap with the following value mapped to the name of the interruption queue that handles interruption events.

apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-global-settings
  namespace: karpenter
data:
  ...
  aws.interruptionQueueName: karpenter-cluster
  ...

Drift

If drift is enabled, Karpenter will deprovision nodes that have been marked as drifted with the annotation karpenter.sh/voluntary-disruption: "drifted". Karpenter will automatically cordon, drain, and terminate nodes, while respecting any PDBs or do-not-evict pods that are configured. Karpenter will automatically mark nodes as drifted if the AMI that is used on the instance does not match the AMI set by the AWSNodeTemplate. Check the AWSNodeTemplate Docs settings for more.

If users annotate their own nodes with karpenter.sh/voluntary-disruption: "drifted", Karpenter will respect the annotation and deprovision the nodes.

To enable the drift feature flag, refer to the Settings Feature Gates.

Disabling Deprovisioning

Pod Eviction

Pods can be opted out of eviction by setting the annotation karpenter.sh/do-not-evict: "true" on the pod. This is useful for pods that you want to run from start to finish without interruption. Examples might include a real-time, interactive game that you don’t want to interrupt, or a long batch job (such as you might have with machine learning) that would need to start over if it were interrupted.

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-evict: "true"

By opting pods out of eviction, you are telling Karpenter that it should not voluntarily remove nodes containing this pod.

However, if a do-not-evict pod is added to a node while the node is draining, the remaining pods will still evict, but that pod will block termination until it is removed. In either case, the node will be cordoned to prevent additional work from scheduling.

Examples of voluntary node removal that will be prevented by this annotation include:

This annotation will have no effect for static pods, pods that tolerate NoSchedule, or pods terminating past their graceful termination period.

Node Consolidation

Nodes can be opted out of consolidation deprovisioning by setting the annotation karpenter.sh/do-not-consolidate: "true" on the node.

apiVersion: karpenter.sh/v1alpha5
kind: Node
metadata:
  annotations:
    karpenter.sh/do-not-consolidate: "true"

Example: Disable Consolidation on Provisioner

Provisioner .spec.annotations allow you to set annotations that will be applied to all nodes launched by this provisioner. By setting the annotation karpenter.sh/do-not-consolidate: "true" on the provisioner, you will selectively prevent all nodes launched by this Provisioner from being considered in consolidation calculations.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  annotations: # will be applied to all nodes
    karpenter.sh/do-not-consolidate: "true"
Last modified February 21, 2023 : chore: Release v0.25.0 (#3430) (5f881099)