Metrics

Inspect Karpenter Metrics

Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at karpenter.karpenter.svc.cluster.local:8000/metrics configurable via the METRICS_PORT environment variable documented here

`karpenter_build_info`

A metric with a constant ‘1’ value labeled by version from which karpenter was built.

Nodepool Metrics

`karpenter_nodepool_usage`

The nodepool usage is the amount of resources that have been provisioned by a particular nodepool. Labeled by nodepool name and resource type.

`karpenter_nodepool_limit`

The nodepool limits are the limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type.

Nodes Metrics

`karpenter_nodes_total_pod_requests`

Node total pod requests are the resources requested by non-DaemonSet pods bound to nodes.

`karpenter_nodes_total_pod_limits`

Node total pod limits are the resources specified by non-DaemonSet pod limits.

`karpenter_nodes_total_daemon_requests`

Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.

`karpenter_nodes_total_daemon_limits`

Node total daemon limits are the resources specified by DaemonSet pod limits.

`karpenter_nodes_termination_time_seconds`

The time taken between a node’s deletion request and the removal of its finalizer

`karpenter_nodes_terminated`

Number of nodes terminated in total by Karpenter. Labeled by owning nodepool.

`karpenter_nodes_system_overhead`

Node system daemon overhead are the resources reserved for system overhead, the difference between the node’s capacity and allocatable values are reported by the status.

`karpenter_nodes_leases_deleted`

Number of deleted leaked leases.

`karpenter_nodes_eviction_queue_depth`

The number of pods currently waiting for a successful eviction in the eviction queue.

`karpenter_nodes_created`

Number of nodes created in total by Karpenter. Labeled by owning nodepool.

`karpenter_nodes_allocatable`

Node allocatable are the resources allocatable by nodes.

Pods Metrics

`karpenter_pods_state`

Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase.

`karpenter_pods_startup_time_seconds`

The time from pod creation until the pod is running.

Provisioner Metrics

`karpenter_provisioner_scheduling_simulation_duration_seconds`

Duration of scheduling simulations used for deprovisioning and provisioning in seconds.

`karpenter_provisioner_scheduling_queue_depth`

The number of pods currently waiting to be scheduled.

`karpenter_provisioner_scheduling_duration_seconds`

Duration of scheduling process in seconds.

Nodeclaims Metrics

`karpenter_nodeclaims_terminated`

Number of nodeclaims terminated in total by Karpenter. Labeled by reason the nodeclaim was terminated and the owning nodepool.

`karpenter_nodeclaims_registered`

Number of nodeclaims registered in total by Karpenter. Labeled by the owning nodepool.

`karpenter_nodeclaims_launched`

Number of nodeclaims launched in total by Karpenter. Labeled by the owning nodepool.

`karpenter_nodeclaims_initialized`

Number of nodeclaims initialized in total by Karpenter. Labeled by the owning nodepool.

`karpenter_nodeclaims_drifted`

Number of nodeclaims drifted reasons in total by Karpenter. Labeled by drift type of the nodeclaim and the owning nodepool.

`karpenter_nodeclaims_disrupted`

Number of nodeclaims disrupted in total by Karpenter. Labeled by disruption type of the nodeclaim and the owning nodepool.

`karpenter_nodeclaims_created`

Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool.

Interruption Metrics

`karpenter_interruption_received_messages`

Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.

`karpenter_interruption_message_latency_time_seconds`

Length of time between message creation in queue and an action taken on the message by the controller.

`karpenter_interruption_deleted_messages`

Count of messages deleted from the SQS queue.

`karpenter_interruption_actions_performed`

Number of notification actions performed. Labeled by action

Disruption Metrics

`karpenter_disruption_replacement_nodeclaim_initialized_seconds`

Amount of time required for a replacement nodeclaim to become initialized.

`karpenter_disruption_replacement_nodeclaim_failures_total`

The number of times that Karpenter failed to launch a replacement node for disruption. Labeled by disruption method.

`karpenter_disruption_queue_depth`

The number of commands currently being waited on in the disruption orchestration queue.

`karpenter_disruption_pods_disrupted_total`

Total number of reschedulable pods disrupted on nodes. Labeled by NodePool, disruption action, method, and consolidation type.

`karpenter_disruption_nodes_disrupted_total`

Total number of nodes disrupted. Labeled by NodePool, disruption action, method, and consolidation type.

`karpenter_disruption_evaluation_duration_seconds`

Duration of the disruption evaluation process in seconds. Labeled by method and consolidation type.

`karpenter_disruption_eligible_nodes`

Number of nodes eligible for disruption by Karpenter. Labeled by disruption method and consolidation type.

`karpenter_disruption_consolidation_timeouts_total`

Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type.

`karpenter_disruption_budgets_allowed_disruptions`

The number of nodes for a given NodePool that can be disrupted at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point.

`karpenter_disruption_actions_performed_total`

Number of disruption actions performed. Labeled by disruption action, method, and consolidation type.

Consistency Metrics

`karpenter_consistency_errors`

Number of consistency checks that have failed.

Cluster State Metrics

`karpenter_cluster_state_synced`

Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter’s cluster state

`karpenter_cluster_state_node_count`

Current count of nodes in cluster state

Cloudprovider Metrics

`karpenter_cloudprovider_instance_type_offering_price_estimate`

Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone.

`karpenter_cloudprovider_instance_type_offering_available`

Instance type offering availability, based on instance type, capacity type, and zone

`karpenter_cloudprovider_instance_type_memory_bytes`

Memory, in bytes, for a given instance type.

`karpenter_cloudprovider_instance_type_cpu_cores`

VCPUs cores for a given instance type.

`karpenter_cloudprovider_errors_total`

Total number of errors returned from CloudProvider calls.

`karpenter_cloudprovider_duration_seconds`

Duration of cloud provider method calls. Labeled by the controller, method name and provider.

Cloudprovider Batcher Metrics

`karpenter_cloudprovider_batcher_batch_time_seconds`

Duration of the batching window per batcher

`karpenter_cloudprovider_batcher_batch_size`

Size of the request batch per batcher

Controller Runtime Metrics

`controller_runtime_reconcile_total`

Total number of reconciliations per controller

`controller_runtime_reconcile_time_seconds`

Length of time per reconciliation per controller

`controller_runtime_reconcile_errors_total`

Total number of reconciliation errors per controller

`controller_runtime_max_concurrent_reconciles`

Maximum number of concurrent reconciles per controller

`controller_runtime_active_workers`

Number of currently used workers per controller

Last modified June 7, 2024: chore: sync v1 staging branch with main (#6335) (331f1acb)