Troubleshooting
Controller
Enable debug logging
This can be done by updating the env variable LOG_LEVEL
Karpenter deployment and then restarting the Karpenter deployment.
You can also enable debug logging during installation with Helm by setting the option logLevel
.
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--set logLevel=debug \
...
Installation
Missing Service Linked Role
Unless your AWS account has already onboarded to EC2 Spot, you will need to create the service linked role to avoid ServiceLinkedRoleCreationNotPermitted
.
AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances
This can be resolved by creating the Service Linked Role.
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
Failed Resolving STS Credentials with I/O Timeout
Checking EC2 API connectivity, WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.us-east-1.amazonaws.com/\": dial tcp: lookup sts.us-east-1.amazonaws.com: i/o timeout
If you see the error above when you attempt to install Karpenter, this indicates that Karpenter is unable to reach out to the STS endpoint due to failed DNS resolution. This can happen when Karpenter is running with dnsPolicy: ClusterFirst
and your in-cluster DNS service is not yet running.
You have two mitigations to resolve this error:
- Let Karpenter manage your in-cluster DNS service - You can let Karpenter manage your DNS application pods’ capacity by changing Karpenter’s
dnsPolicy
to beDefault
(run--set dnsPolicy=Default
with a Helm installation). This ensures that Karpenter reaches out to the VPC DNS service when running its controllers, allowing Karpenter to start-up without the DNS application pods running, enabling Karpenter to manage the capacity for these pods. - Let MNG/Fargate manage your in-cluster DNS service - If running a cluster with MNG, ensure that your group has enough capacity to support the DNS application pods and ensure that the application has the correct tolerations to schedule against the capacity. If running a cluster with Fargate, ensure that you have a fargate profile that selects against your DNS application pods.
Karpenter Role names exceeding 64-character limit
If you use a tool such as AWS CDK to generate your Kubernetes cluster name, when you add Karpenter to your cluster you could end up with a cluster name that is too long to incorporate into your KarpenterNodeRole name (which is limited to 64 characters).
Node role names for Karpenter are created in the form KarpenterNodeRole-${Cluster_Name}
in the Create the KarpenterNode IAM Role section of the getting started guide.
If a long cluster name causes the Karpenter node role name to exceed 64 characters, creating that object will fail.
Keep in mind that KarpenterNodeRole-
is just a recommendation from the getting started guide.
Instead of using the eksctl role, you can shorten the name to anything you like, as long as it has the right permissions.
Unknown field in NodePool or EC2NodeClass spec
If you are upgrading from an older version of Karpenter, there may have been changes in the CRD between versions. Attempting to utilize newer functionality which is surfaced in newer versions of the CRD may result in the following error message:
Error from server (BadRequest): error when creating "STDIN": NodePool in version "v1" cannot be handled as a NodePool: strict decoding error: unknown field "spec.template.spec.nodeClassRef.foo"
If you see this error, you can solve the problem by following the Custom Resource Definition Upgrade Guidance.
Info on whether there has been a change to the CRD between versions of Karpenter can be found in the Release Notes
Unable to schedule pod due to insufficient node group instances
0.16.0
changed the default replicas from 1 to 2.
Karpenter won’t launch capacity to run itself (log related to the karpenter.sh/nodepool DoesNotExist requirement
)
so it can’t provision for the second Karpenter pod.
To solve this you can either reduce the replicas back from 2 to 1, or ensure there is enough capacity that isn’t being managed by Karpenter to run both pods.
To do so on AWS increase the minimum
and desired
parameters on the node group autoscaling group to launch at lease 2 instances.
Helm Error When Pulling the Chart
If Helm is showing an error when trying to install Karpenter Helm charts:
- Ensure you are using a newer Helm version, Helm started supporting OCI images since
3.8.0
. - Helm does not have an
helm repo add
concept in OCI, so to install Karpenter you no longer need this. - If you get an error like
Error: public.ecr.aws/karpenter/karpenter:0.34.0: not found
make sure you’re adding av
prefix for Karpenter versions between0.17.0
&0.34.x
. - Verify that the image you are trying to pull actually exists in gallery.ecr.aws/karpenter
- Sometimes Helm generates a generic error, you can add the –debug switch to any of the Helm commands in this doc for more verbose error messages
- If you are getting a 403 forbidden error, you can try
docker logout public.ecr.aws
as explained here.
Helm Error when installing the karpenter-crd
chart
Karpenter 0.26.1
introduced the karpenter-crd
Helm chart. When installing this chart on your cluster, if you have previously added the Karpenter CRDs to your cluster through the karpenter
controller chart or through kubectl replace
, Helm will reject the install of the chart due to invalid ownership metadata
.
- In the case of
invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"
run:
kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh app.kubernetes.io/managed-by=Helm --overwrite
- In the case of
annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "karpenter"
run:
KARPENTER_NAMESPACE=kube-system
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-name=karpenter-crd --overwrite
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-namespace="${KARPENTER_NAMESPACE}" --overwrite
Uninstallation
Unable to delete nodes after uninstalling Karpenter
Karpenter adds a finalizer to nodes that it provisions to support graceful node termination. If Karpenter is uninstalled, these finalizers will cause the API Server to block deletion until the finalizers are removed.
You can fix this by patching the node objects:
kubectl edit node <node_name>
and remove the line that sayskarpenter.sh/termination
in the finalizers field.- Run the following script that gets all nodes with the finalizer and removes all the finalizers from those nodes.
- NOTE: this will remove ALL finalizers from nodes with the karpenter finalizer.
kubectl get nodes -ojsonpath='{range .items[*].metadata}{@.name}:{@.finalizers}{"\n"}' | grep "karpenter.sh/termination" | cut -d ':' -f 1 | xargs kubectl patch node --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
Provisioning
Instances with swap volumes fail to register with control plane
Some instance types (c1.medium and m1.small) are given limited amount of memory (see Instance Store swap volumes). They are subsequently configured to use a swap volume, which will cause the kubelet to fail on launch. The following error can be seen in the systemd logs:
"command failed" err="failed to run Kubelet: running with swap on is not supported, please disable swap!..."
Solutions
Disabling swap will allow kubelet to join the cluster successfully, however users should be mindful of performance, and consider adjusting the NodePool requirements to use larger instance types.
DaemonSets can result in deployment failures
For Karpenter versions 0.5.3
and earlier, DaemonSets were not properly considered when provisioning nodes.
This sometimes caused nodes to be deployed that could not meet the needs of the requested DaemonSets and workloads.
This issue no longer occurs after Karpenter version 0.5.3
(see PR #1155).
If you are using a pre 0.5.3
version of Karpenter, one workaround is to set your NodePool to only use larger instance types that you know will be big enough for the DaemonSet and the workload.
For more information, see Issue #1084.
Examples of this behavior are included in Issue #1180.
Unspecified resource requests cause scheduling/bin-pack failures
Not using the Kubernetes LimitRanges feature to enforce minimum resource request sizes will allow pods with very low or non-existent resource requests to be scheduled. This can cause issues as Karpenter bin-packs pods based on the resource requests.
If the resource requests do not reflect the actual resource usage of the pod, Karpenter will place too many of these pods onto the same node resulting in the pods getting CPU throttled or terminated due to the OOM killer.
This behavior is not unique to Karpenter and can also occur with the standard kube-scheduler
with pods that don’t have accurate resource requests.
To prevent this, you can set LimitRanges on pod deployments on a per-namespace basis. See the Karpenter Best Practices Guide for further information on the use of LimitRanges.
Pods using Security Groups for Pods stuck in “ContainerCreating” state for up to 30 minutes before transitioning to “Running”
When leveraging Security Groups for Pods, Karpenter will launch nodes as expected but pods will be stuck in “ContainerCreating” state for up to 30 minutes before transitioning to “Running”.
This is related to an interaction between Karpenter and the amazon-vpc-resource-controller when a pod requests vpc.amazonaws.com/pod-eni
resources.
More info can be found in issue #1252.
To workaround this problem, add the vpc.amazonaws.com/has-trunk-attached: "false"
label in your Karpenter NodePool spec and ensure instance-type requirements include instance-types which support ENI trunking.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template
metadata:
labels:
vpc.amazonaws.com/has-trunk-attached: "false"
Pods using PVCs can hit volume limits and fail to scale-up
When attempting to schedule a large number of pods with PersistentVolumes, it’s possible that these pods will co-locate on the same node. Pods will report the following errors in their events using a kubectl describe pod
call
Warning FailedAttachVolume pod/example-pod AttachVolume.Attach failed for volume "***" : rpc error: code = Internal desc = Could not attach volume "***" to node "***": attachment of disk "***" failed, expected device to be attached but was attaching
Warning FailedMount pod/example-pod Unable to attach or mount volumes: unmounted volumes=[***], unattached volumes=[***]: timed out waiting for the condition
In this case, Karpenter may fail to scale-up your nodes due to these pods due to one of the following reasons:
Pods were not scheduled but Karpenter couldn’t discover limits
Karpenter does not support in-tree storage plugins to provision PersistentVolumes, since nearly all of the in-tree plugins have been deprecated in upstream Kubernetes. This means that, if you are using a statically-provisioned PersistentVolume that references a volume source like AWSElasticBlockStore
or a dynamically-provisioned PersistentVolume that references a StorageClass with a in-tree storage plugin provisioner like kubernetes.io/aws-ebs
, Karpenter will fail to discover the maxiumum volume attachments for the node. Instead, Karpenter may think the node still has more schedulable space due to memory and cpu constraints when there is really no more schedulable space on the node due to volume limits. When Karpenter sees you are using an in-tree storage plugin on your pod volumes, it will print the following error message into the logs. If you see this message, upgrade your StorageClasses and statically-provisioned PersistentVolumes to use the latest CSI drivers for your cloud provider.
2023-04-05T23:56:53.363Z ERROR controller.node_state PersistentVolume source 'AWSElasticBlockStore' uses an in-tree storage plugin which is unsupported by Karpenter and is deprecated by Kubernetes. Scale-ups may fail because Karpenter will not discover driver limits. Use a PersistentVolume that references the 'CSI' volume source for Karpenter auto-scaling support. {"commit": "b2af562", "node": "ip-192-168-36-137.us-west-2.compute.internal", "pod": "inflate0-6c4bdb8b75-7qmfd", "volume": "mypd", "persistent-volume": "pvc-11db7489-3c6e-46f3-a958-91f9d5009d41"}
2023-04-05T23:56:53.464Z ERROR controller.node_state StorageClass .spec.provisioner uses an in-tree storage plugin which is unsupported by Karpenter and is deprecated by Kubernetes. Scale-ups may fail because Karpenter will not discover driver limits. Create a new StorageClass with a .spec.provisioner referencing the CSI driver plugin name 'ebs.csi.aws.com'. {"commit": "b2af562", "node": "ip-192-168-36-137.us-west-2.compute.internal", "pod": "inflate0-6c4bdb8b75-7qmfd", "volume": "mypd", "storage-class": "gp2", "provisioner": "kubernetes.io/aws-ebs"}
Pods were scheduled due to a race condition in Kubernetes
Due to this race condition in Kubernetes, it’s possible that the scheduler and the CSINode can race during node registration such that the scheduler assumes that a node can mount more volumes than the node attachments support. There is currently no universal solve for this problem other than enforcing topologySpreadConstraints
and podAntiAffinity
on your workloads that use PVCs such that you attempt to reduce the number of PVCs that schedule to a given node.
The following is a list of known CSI drivers which support a startupTaint to eliminate this issue:
These taints should be configured via startupTaints
on your NodePool
. For example, to enable this for EBS, add the following to your NodePool
:
apiVersion: karpenter.sh/v1
kind: NodePool
spec:
template:
spec:
startupTaints:
- key: ebs.csi.aws.com/agent-not-ready
effect: NoExecute
CNI is unable to allocate IPs to pods
Note: This troubleshooting guidance is specific to the VPC CNI that is shipped by default with EKS clusters. If you are using a custom CNI, some of this guidance may not apply to your cluster.
Whenever a new pod is assigned to a node, the CNI will assign an IP address to that pod (assuming it isn’t using host networking), allowing it to communicate with other pods on the cluster. It’s possible for this IP allocation and assignment process to fail for a number of reasons. If this process fails, you may see an error similar to the one below.
time=2023-06-12T19:18:15Z type=Warning reason=FailedCreatePodSandBox from=kubelet message=Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0f46f3f1289eed7afab81b6945c49336ef556861fe5bb09a902a00772848b7cc": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
maxPods
is greater than the node’s supported pod density
By default, the number of pods on a node is limited by both the number of networking interfaces (ENIs) that may be attached to an instance type and the number of IP addresses that can be assigned to each ENI. See IP addresses per network interface per instance type for a more detailed information on these instance types’ limits.
If the max-pods (configured through your EC2NodeClass kubeletConfiguration
) is greater than the number of supported IPs for a given instance type, the CNI will fail to assign an IP to the pod and your pod will be left in a ContainerCreating
state.
If you’ve enabled Security Groups per Pod, one of the instance’s ENIs is reserved as the trunk interface and uses branch interfaces off of that trunk interface to assign different security groups.
If you do not have any SecurityGroupPolicies
configured for your pods, they will be unable to utilize branch interfaces attached to the trunk interface, and IPs will only be available from the non-trunk ENIs.
This effectively reduces the max-pods value by the number of IPs that would have been available from the trunk ENI.
Note that Karpenter is not aware if Security Groups per Pod is enabled, and will continue to compute max-pods assuming all ENIs on the instance can be utilized.
Solutions
To avoid this discrepancy between maxPods
and the supported pod density of the EC2 instance based on ENIs and allocatable IPs, you can perform one of the following actions on your cluster:
- Enable Prefix Delegation to increase the number of allocatable IPs for the ENIs on each instance type
- Reduce your
maxPods
value to be under the maximum pod density for the instance types assigned to your NodePods - Remove the
maxPods
value from yourkubeletConfiguration
if you no longer need it and instead rely on the defaulted values from Karpenter and EKS AMIs. - Set RESERVED_ENIS=1 in your Karpenter configuration to account for the reserved ENI when using Security Groups for Pods.
For more information on pod density, view the Pod Density Section in the NodePools doc.
IP exhaustion in a subnet
When a node is launched by Karpenter, it is assigned to a subnet within your VPC based on the subnetSelector
value in your EC2NodeClass
). When a subnet becomes IP address constrained, EC2 may think that it can successfully launch an instance in the subnet; however, when the CNI tries to assign IPs to the pods, there are none remaining. In this case, your pod will stay in a ContainerCreating
state until an IP address is freed in the subnet and the CNI can assign one to the pod.
Solutions
- Use
topologySpreadConstraints
ontopology.kubernetes.io/zone
to spread your pods and nodes more evenly across zones - Increase the IP address space (CIDR) for the subnets selected by your
EC2NodeClass
- Use custom networking to assign separate IP address spaces to your pods and your nodes
- Run your EKS cluster on IPv6 (Note: IPv6 clusters have some known limitations which should be well-understood before choosing to use one)
For more troubleshooting information on why your pod may have a FailedCreateSandbox
error, view the EKS CreatePodSandbox Knowledge Center Post.
Windows pods are failing with FailedCreatedPodSandbox
The following solution(s) may resolve your issue if you are seeing an error similar to the following when attempting to launch Windows pods. This error typically occurs if you have not enabled Windows support.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": plugin type="vpc-bridge" name="vpc" failed (add): failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
Solutions
- See Enabling Windows support for instructions on how to enable Windows support.
Windows pods fail to launch with image pull error
The following solution(s) may resolve your issue if you are seeing an error similar to the following when attempting to launch Windows pods.
Failed to pull image "mcr.microsoft.com/windows/servercore:xxx": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/windows/servercore:xxx": no match for platform in manifest: not found
This error typically occurs in a scenario whereby a pod with a given container OS version attempts to be scheduled on an incompatible Windows host OS version. Windows requires the host OS version to match the container OS version.
Solutions
- Define your pod’s
nodeSelector
to ensure that your containers are scheduled on a compatible OS host version. To learn more, see Windows container version compatibility.
Windows pods unable to resolve DNS
Causes for DNS resolution failure may vary, but in the case where DNS resolution is working for Linux pods but not for Windows pods, then the following solution(s) may resolve your issue.
Solution(s)
- Verify that the instance role of the Windows node includes the RBAC permission group
eks:kube-proxy-windows
as shown below. This group is required for Windows nodes because in Windows,kube-proxy
runs as a process on the node, and as such, the node requires the necessary RBAC cluster permissions to allow access to the resources required bykube-proxy
. For more information, see https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html.
...
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
- eks:kube-proxy-windows # This is required for Windows DNS resolution to work
...
Karpenter incorrectly computes available resources for a node
When creating nodes, the allocatable resources Karpenter computed (as seen in logs and nodeClaim.status.allocatable
) do not always match the allocatable resources on the created node (node.status.allocatable
) due to some amount of memory being reserved for the hypervisor and underlying OS.
Karpenter uses the results from ec2:DescribeInstanceTypes
along with a cache for tracking observed memory capacity to determine the resources available on a node launched with a given instance type.
The following computation is used to determine allocatable CPU, memory, and ephemeral storage based on the results returned from ec2:DescribeInstanceTypes
.
### cpu
nodeClaim.allocatable.cpu = instance.cpu - kubeReserved.cpu - systemReserved.cpu
### memory
# If first time launching this instance-type + AMI pair
nodeClaim.allocatable.memory = (instance.memory * (1.0 - VM_MEMORY_OVERHEAD_PERCENT)) - kubeReserved.memory - systemReserved.memory - max(evictionSoft.memory.available, evictionHard.memory.available)
# For subsequent nodes where cached instance-type capacity is available
nodeClaim.allocatable.memory = ( cached.instance.memory - kubeReserved.memory - systemReserved.memory - max(evictionSoft.memory.available, evictionHard.memory.available)
### ephemeral-storage
nodeClaim.allocatable.ephemeralStorage = instance.storage - kubeReserved.ephemeralStorage - systemReserved.ephemeralStorage - max(evictionSoft.nodefs.available, evictionHard.nodefs.available)
Most of these factors directly model user configuration (i.e. the KubeletConfiguration options).
On the other hand, VM_MEMORY_OVERHEAD_PERCENT
models an implicit reduction of available memory that varies by instance type and AMI.
However, once a node is created, the actual memory capacity on that node (node.status.capacity.memory) is checked by the controller. The controller caches the observed memory for any subsequent nodes launched with the same AMI and instance type pair, improving accuracy for future nodes.
For new combinations of AMI and instance type (i.e., when this pair is launched for the first time), Karpenter will still use the VM_MEMORY_OVERHEAD_PERCENT value as a fallback for estimating allocatable memory.
This fallback is necessary because Karpenter can’t compute the exact value being modeled ahead of time, so VM_MEMORY_OVERHEAD_PERCENT
is a global setting used across all instance type and AMI combinations.
The default value (7.5%
) has been tuned to closely match reality for the majority of instance types while not overestimating.
As a result, Karpenter will typically underestimate the memory available on a node for a given instance type.
If you know the real VM_MEMORY_OVERHEAD_PERCENT
for the specific instances you’re provisioning in your cluster, you can tune this value to tighten the bound.
However, this should be done with caution.
A VM_MEMORY_OVERHEAD_PERCENT
which results in Karpenter overestimating the memory available on a node can result in Karpenter launching nodes which are too small for your workload.
To detect instances of Karpenter overestimating resource availability, the following status condition can be monitored:
$ kg nodeclaim $NODECLAIM_NAME -o jsonpath='{.status.conditions[?(@.type=="ConsistentStateFound")]}'
{
"type": "ConsistentStateFound",
"status": "False",
"reason": "ConsistencyCheckFailed",
"message": "Consistency Check Failed",
"lastTransitionTime": "2024-08-19T20:02:16Z"
}
This can be spot checked like shown above, or monitored via the following metric:
operator_status_condition_count{type="ConsistentStateFound",kind="NodeClaim",status="False"}
Karpenter Is Unable to Satisfy Topology Spread Constraint
When scheduling pods with TopologySpreadConstraints, Karpenter will attempt to spread the pods across all eligible domains. Eligible domains are determined based on the pod’s requirements, e.g. node affinity terms. However, pod’s do not inherit the requirements of compatible NodePools.
For example, consider the following NodePool and Deployment specs:
appVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: topology.kubernetes.io/zone
operator: Exists
---
appVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: np-zonal-constraint
labels:
project: zone-specific-project
spec:
template:
spec:
requirements:
- key: topology.kubernetes.io/zone
operator: In
values: ['us-east-1a', 'us-east-1b']
# ...
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 3
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
nodeSelector:
project: zone-specific-project
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: inflate
This cluster has subnets in three availability zones: us-east-1a
, us-east-1b
, and us-east-1c
.
NodePool default
can launch instance types in all three zones, but np-zonal-constraint
is constrained to two.
Since Karpenter uses the pod’s requirements to derive eligible domains, and the pod does not have any zonal constraints, all three availability zones are considered eligible domains.
However, the only NodePool compatible with the pod’s requirements is np-zonal-constraints
, which can only create instances in two of the three eligible domains.
Karpenter will succeed to launch the first two instances, for the first two replicas, but will fail to provision capacity for subsequent replicas since it can’t provision capacity in the third domain.
In order to prevent these scenarios, you should ensure that all eligible domains for a pod can be provisioned by compatible NodePools, or constrain the pod such that it’s eligble domains match those of the NodePools.
To resolve this specific issue, zonal constraints should be added to the pod spec to match the requirements of np-zonal-constraint
:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ['us-east-1a', 'us-east-1b']
Deprovisioning
Nodes not deprovisioned
There are a few cases where requesting to deprovision a Karpenter node will fail or will never be attempted. These cases are outlined below in detail.
Initialization
Karpenter determines the nodes that it can begin to consider for deprovisioning by looking at the karpenter.sh/initialized
node label. If this node label is not set on a Node, Karpenter will not consider it for any automatic deprovisioning. For more details on what may be preventing nodes from being initialized, see Nodes not initialized.
Disruption budgets
Karpenter respects Pod Disruption Budgets (PDBs) by using a backoff retry eviction strategy. Pods will never be forcibly deleted, so pods that fail to shut down will prevent a node from deprovisioning. Kubernetes PDBs let you specify how much of a Deployment, ReplicationController, ReplicaSet, or StatefulSet must be protected from disruptions when pod eviction requests are made.
PDBs can be used to strike a balance by protecting the application’s availability while still allowing a cluster administrator to manage the cluster.
Here is an example where the pods matching the label myapp
will block node termination if evicting the pod would reduce the number of available pods below 4.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 4
selector:
matchLabels:
app: myapp
You can set minAvailable
or maxUnavailable
as integers or as a percentage.
Review what disruptions are, and how to configure them.
karpenter.sh/do-not-disrupt
Annotation
If a pod exists with the annotation karpenter.sh/do-not-disrupt: true
on a node, and a request is made to delete the node, Karpenter will not drain any pods from that node or otherwise try to delete the node. Nodes that have pods with a do-not-disrupt
annotation are not considered for consolidation, though their unused capacity is considered for the purposes of running pods from other nodes which can be consolidated.
If you want to terminate a node with a do-not-disrupt
pod, you can simply remove the annotation and the deprovisioning process will continue.
Scheduling Constraints (Consolidation Only)
Consolidation will be unable to consolidate a node if, as a result of its scheduling simulation, it determines that the pods on a node cannot run on other nodes due to inter-pod affinity/anti-affinity, topology spread constraints, or some other scheduling restriction that couldn’t be fulfilled.
Node Launch/Readiness
Node not created
In some circumstances, Karpenter controller can fail to start up a node. For example, providing the wrong block storage device name in a custom launch template can result in a failure to start the node and an error similar to:
2022-01-19T18:22:23.366Z ERROR controller.provisioning Could not launch node, launching instances, with fleet error(s), InvalidBlockDeviceMapping: Invalid device name /dev/xvda; ...
You can see errors like this by viewing Karpenter controller logs:
kubectl get pods -A | grep karpenter
karpenter karpenter-XXXX 2/2 Running 2 21d
kubectl logs karpenter-XXXX -c controller -n karpenter | less
Nodes not initialized
Karpenter uses node initialization to understand when to begin using the real node capacity and allocatable details for scheduling. It also utilizes initialization to determine when it can being consolidating nodes managed by Karpenter.
Karpenter determines node initialization using three factors:
- Node readiness
- Expected resources are registered
- NodePool startup taints are removed
Node Readiness
Karpenter checks the Ready
condition type and expects it to be True
.
To see troubleshooting around what might be preventing nodes from becoming ready, see Node NotReady
Expected resources are registered
Karpenter pull instance type information, including all expected resources that should register to your node. It then expects all these resources to properly register to a non-zero quantity in node .status.allocatable
.
Common resources that don’t register and leave nodes in a non-initialized state:
nvidia.com/gpu
(or any gpu-based resource): A GPU instance type that supports thenvidia.com/gpu
resource is launched but the daemon/daemonset to register the resource on the node doesn’t existvpc.amazonaws.com/pod-eni
: An instance type is launched by theENABLE_POD_ENI
value is set tofalse
in thevpc-cni
plugin. Karpenter will expect that thevpc.amazonaws.com/pod-eni
will be registered, but it never will.
NodePool startup taints are removed
Karpenter expects all startup taints specified in .spec.template.spec.startupTaints
of the NodePool to be completely removed from node .spec.taints
before it will consider the node initialized.
Node NotReady
There are cases where the node starts, but fails to join the cluster and is marked “Node NotReady”. Reasons that a node can fail to join the cluster include:
- Permissions
- Security Groups
- Networking
The easiest way to start debugging is to connect to the instance and get the Kubelet logs. For an AL2 based node:
# List the nodes managed by Karpenter
kubectl get node -l karpenter.sh/nodepool
# Extract the instance ID (replace <node-name> with a node name from the above listing)
INSTANCE_ID=$(kubectl get node <node-name> -ojson | jq -r ".spec.providerID" | cut -d \/ -f5)
# Connect to the instance
aws ssm start-session --target $INSTANCE_ID
# Check Kubelet logs
sudo journalctl -u kubelet
For Bottlerocket, you’ll need to get access to the root filesystem:
# List the nodes managed by Karpenter
kubectl get node -l karpenter.sh/nodepool
# Extract the instance ID (replace <node-name> with a node name from the above listing)
INSTANCE_ID=$(kubectl get node <node-name> -ojson | jq -r ".spec.providerID" | cut -d \/ -f5)
# Connect to the instance
aws ssm start-session --target $INSTANCE_ID
# Enter the admin container
enter-admin-container
# Check Kubelet logs
journalctl -D /.bottlerocket/rootfs/var/log/journal -u kubelet.service
Here are examples of errors from Node NotReady issues that you might see from journalctl
:
The runtime network not being ready can reflect a problem with IAM role permissions:
KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
See Amazon EKS node IAM role for details. If you’re using
eksctl
, the VPC CNI pods may be given permissions through IRSA instead. Verify that this set up is working as intended. You can also look at the logs for your CNI plugin from theaws-node
pod:kubectl get pods -n kube-system | grep aws-node
aws-node-????? 1/1 Running 2 20d
kubectl logs aws-node-????? -n kube-system
Not being able to register the node with the Kubernetes API server indicates an error condition like the following:
Attempting to register node" node="ip-192-168-67-130.ec2.internal" Unable to register node with API server" err="Unauthorized" node="ip-192-168-67-130.ec2.internal" Error getting node" err="node \"ip-192-168-67-130.ec2.internal\" not found Failed to contact API server when waiting for CSINode publishing: Unauthorized
Check the ConfigMap to check whether or not the correct node role is there. For example:
kubectl get configmaps -n kube-system aws-auth -o yaml
apiVersion: v1 data: mapRoles: | - groups: - system:bootstrappers - system:nodes rolearn: arn:aws:iam::973227887653:role/eksctl-johnw-karpenter-demo-NodeInstanceRole-72CV61KQNOYS username: system:node:{{EC2PrivateDNSName}} - groups: - system:bootstrappers - system:nodes rolearn: arn:aws:iam::973227887653:role/KarpenterNodeRole-johnw-karpenter-demo username: system:node:{{EC2PrivateDNSName}} mapUsers: | [] kind: ConfigMap ...
If you are not able to resolve the Node NotReady issue on your own, run the EKS Logs Collector (if it’s an EKS optimized AMI) and look in the following places in the log:
- Your UserData (in
/var_log/cloud-init-output.log
and/var_log/cloud-init.log
) - Your kubelets (
/kubelet/kubelet.log
) - Your networking pod logs (
/var_log/aws-node
)
Reach out to the Karpenter team on Slack or GitHub if you are still stuck.
Nodes stuck in pending and not running the kubelet due to outdated CNI
If you have an EC2 instance get launched that is stuck in pending and ultimately not running the kubelet, you may see a message like this in your /var/log/user-data.log
:
No entry for c6i.xlarge in /etc/eks/eni-max-pods.txt
This means that your CNI plugin is out of date. You can find instructions on how to update your plugin here.
Node terminates before ready on failed encrypted EBS volume
If you are using a custom launch template and an encrypted EBS volume, the IAM principal launching the node may not have sufficient permissions to use the KMS customer managed key (CMK) for the EC2 EBS root volume. This issue also applies to Block Device Mappings specified in the EC2NodeClass. In either case, this results in the node terminating almost immediately upon creation.
Keep in mind that it is possible that EBS Encryption can be enabled without your knowledge. EBS encryption could have been enabled by an account administrator or by default on a per region basis. See Encryption by default for details.
To correct the problem if it occurs, you can use the approach that AWS EBS uses, which avoids adding particular roles to the KMS policy. Below is an example of a policy applied to the KMS key:
[
{
"Sid": "Allow access through EBS for all principals in the account that are authorized to use EBS",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:CreateGrant",
"kms:DescribeKey"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": "ec2.${AWS_REGION}.amazonaws.com",
"kms:CallerAccount": "${AWS_ACCOUNT_ID}"
}
}
},
{
"Sid": "Allow direct access to key metadata to the account",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::${AWS_ACCOUNT_ID}:root"
},
"Action": [
"kms:Describe*",
"kms:Get*",
"kms:List*",
"kms:RevokeGrant"
],
"Resource": "*"
}
]
Node is not deleted, even though ttlSecondsUntilExpired
is set or the node is empty
This typically occurs when the node has not been considered fully initialized for some reason. If you look at the logs, you may see something related to an Inflight check failed for node...
that gives more information about why the node is not considered initialized.
Log message of inflight check failed for node, Expected resource "vpc.amazonaws.com/pod-eni" didn't register on the node
is reported
This error indicates that the vpc.amazonaws.com/pod-eni
resource was never reported on the node. You will need to make the corresponding change to the VPC CNI to enable security groups for pods which will cause the resource to be registered.
AWS Node Termination Handler (NTH) interactions
Karpenter doesn’t currently support draining and terminating on spot rebalance recommendations. Users who want support for both drain and terminate on spot interruption as well as drain and termination on spot rebalance recommendations may install Node Termination Handler (NTH) on their clusters to support this behavior.
These two components do not share information between each other, meaning if you have drain and terminate functionality enabled on NTH, NTH may remove a node for a spot rebalance recommendation. Karpenter will replace the node to fulfill the pod capacity that was being fulfilled by the old node; however, Karpenter won’t be aware of the reason that that node was terminated. This means that Karpenter may launch the same instance type that was just deprovisioned, causing a spot rebalance recommendation to be sent again. This can result in very short-lived instances where NTH continually removes nodes and Karpeneter re-launches the same instance type over and over again.
Karpenter doesn’t recommend reacting to spot rebalance recommendations when running Karpenter with spot nodes; however, if you absolutely require this functionality, note that the above scenario is possible. Spot instances are time limited and, therefore, interruptible. When a signal is sent by AWS, it triggers actions from NTH and Karpenter, where the former signals a shutdown and the later provisions, creating a recursive situation. This can be mitigated by either completely removing NTH or by setting the following values:
enableSpotInterruptionDraining: If false, do not drain nodes when the spot interruption termination notice is received. Only used in IMDS mode. enableSpotInterruptionDraining: false
enableRebalanceDrainin: If true, drain nodes when the rebalance recommendation notice is received. Only used in IMDS mode. enableRebalanceDraining: false
Pricing
Stale pricing data on isolated subnet
The following pricing-related error occurs if you are running Karpenter in an isolated private subnet (no Internet egress via IGW or NAT gateways):
ERROR controller.aws.pricing updating on-demand pricing, RequestError: send request failed
caused by: Post "https://api.pricing.us-east-1.amazonaws.com/": dial tcp 52.94.231.236:443: i/o timeout; RequestError: send request failed
caused by: Post "https://api.pricing.us-east-1.amazonaws.com/": dial tcp 52.94.231.236:443: i/o timeout, using existing pricing data from 2022-08-17T00:19:52Z {"commit": "4b5f953"}
This network timeout occurs because there is no VPC endpoint available for the Price List Query API..
To workaround this issue, Karpenter ships updated on-demand pricing data as part of the Karpenter binary; however, this means that pricing data will only be updated on Karpenter version upgrades.
To disable pricing lookups and avoid the error messages, set the AWS_ISOLATED_VPC
environment variable (or the --aws-isolated-vpc
option) to true.
See Environment Variables / CLI Flags for details.