Horizontal Pod Autoscaler: Deep Dive
This document explains how the HPA scaling algorithm works, why resource requests are essential, and when to tune stabilization windows and scaling policies. It connects the demo manifests to the metrics pipeline and production autoscaling patterns.
The Scaling Algorithm
Section titled “The Scaling Algorithm”The HPA controller runs a control loop every 15 seconds (configurable via
--horizontal-pod-autoscaler-sync-period). On each iteration, it:
- Fetches current metric values for all pods in the target workload.
- Computes the desired replica count.
- Scales the target if the desired count differs from the current count.
The core formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))Example from the Demo
Section titled “Example from the Demo”The demo’s HPA targets 50% CPU utilization:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: cpu-burner namespace: hpa-demospec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: cpu-burner minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 behavior: scaleDown: stabilizationWindowSeconds: 60If the single pod is using 100% CPU against a target of 50%:
desiredReplicas = ceil(1 * (100 / 50)) = ceil(2.0) = 2The HPA scales from 1 to 2 replicas. After scaling, if each of the 2 pods uses 50%, the ratio is 1.0 and no further scaling occurs.
If load increases and both pods hit 90%:
desiredReplicas = ceil(2 * (90 / 50)) = ceil(3.6) = 4The HPA jumps to 4 replicas. The ceil() function always rounds up, biasing toward
over-provisioning rather than under-provisioning.
The Tolerance Band
Section titled “The Tolerance Band”The HPA does not scale for small deviations. There is a default tolerance of 10%
(--horizontal-pod-autoscaler-tolerance=0.1). If the ratio of current to desired metric
value is within 0.9 to 1.1, no scaling action occurs. This prevents flapping on minor
fluctuations.
Multiple Metrics
Section titled “Multiple Metrics”When multiple metrics are configured, the HPA computes the desired replica count for each metric independently and takes the maximum. This ensures the workload has enough capacity for the most demanding metric.
Why Resource Requests Are Required
Section titled “Why Resource Requests Are Required”The demo’s app sets CPU requests:
apiVersion: apps/v1kind: Deploymentmetadata: name: cpu-burner namespace: hpa-demospec: replicas: 1 template: spec: containers: - name: app image: registry.k8s.io/hpa-example resources: requests: cpu: 200m memory: 64Mi limits: cpu: 500m memory: 128MiThe HPA calculates CPU utilization as a percentage of the pod’s requests, not its limits. If a pod requests 200m CPU and currently uses 100m, utilization is 50%.
Without CPU requests, the HPA cannot compute utilization. The TARGETS column shows
<unknown> and no scaling happens. This is a common source of confusion. You set up the HPA,
generate load, and nothing happens because the Deployment has no resource requests.
The Metrics Pipeline
Section titled “The Metrics Pipeline”Metrics Server
Section titled “Metrics Server”The metrics server is the default metrics source for the HPA. It collects CPU and memory usage from the kubelet on every node. The kubelet gets these numbers from cAdvisor, which reads cgroup stats from the container runtime.
The flow:
Container Runtime --> cAdvisor --> Kubelet --> Metrics Server --> HPA ControllerThe metrics server exposes the metrics.k8s.io API. The HPA controller queries this API to
get per-pod CPU and memory usage.
In minikube, you enable it with:
minikube addons enable metrics-serverMetrics Lag
Section titled “Metrics Lag”Metrics are not instantaneous. cAdvisor samples every 10-15 seconds. The kubelet exposes metrics every 15 seconds. The metrics server scrapes kubelets every 60 seconds. The HPA controller checks every 15 seconds.
End-to-end, there can be 30-90 seconds of lag between a load spike and a scaling action. The HPA controller accounts for this by using the most recent metric value available, but it cannot react to spikes faster than the pipeline delivers data.
Custom Metrics
Section titled “Custom Metrics”Beyond CPU and memory, the HPA can scale on custom application metrics. This requires a custom metrics adapter (like Prometheus Adapter or KEDA).
metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: 100This scales based on the average HTTP requests per second across all pods. If the average exceeds 100, the HPA scales up.
External Metrics
Section titled “External Metrics”External metrics come from outside the cluster, like a cloud queue length:
metrics: - type: External external: metric: name: sqs_queue_length selector: matchLabels: queue: orders target: type: Value value: 50This scales based on the number of messages in an SQS queue. If the queue grows beyond 50, more pods are added to process messages faster.
Metric Types and Targets
Section titled “Metric Types and Targets”Resource Metrics
Section titled “Resource Metrics”CPU and memory. These are the most common.
Utilization target: Percentage of requests.
metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50AverageValue target: Absolute value per pod.
metrics: - type: Resource resource: name: cpu target: type: AverageValue averageValue: 200mScaling on CPU vs Memory
Section titled “Scaling on CPU vs Memory”CPU scales well. When CPU usage rises, adding more pods distributes the load. When load drops, CPU usage falls and the HPA scales down.
Memory is trickier. Many applications allocate memory at startup and never release it. Adding more pods reduces per-pod load, but existing pods may not free memory. The HPA might scale up successfully but never scale down because memory stays high.
For this reason, most production HPAs scale on CPU or custom request-rate metrics. Memory-based scaling works best for workloads with proportional memory usage (like in-memory caches that grow with request volume).
Scale-Up and Scale-Down Behavior
Section titled “Scale-Up and Scale-Down Behavior”The behavior field gives fine-grained control over scaling speed.
Scale-Up Policies
Section titled “Scale-Up Policies”behavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 60 - type: Pods value: 4 periodSeconds: 60 selectPolicy: MaxThis says: “In any 60-second window, scale up by at most 100% of current replicas OR 4 pods,
whichever is larger.” With 2 current replicas, 100% = 2 pods and 4 pods = 4 pods, so the
max is 4.
selectPolicy: Max takes the larger of the two policies. Use Min to take the smaller
(more conservative) value. Use Disabled to prevent scaling up entirely.
Scale-Down Policies
Section titled “Scale-Down Policies”behavior: scaleDown: stabilizationWindowSeconds: 60 policies: - type: Percent value: 50 periodSeconds: 120In any 120-second window, scale down by at most 50% of current replicas. With 10 replicas, this means at most 5 pods are removed per 2-minute period.
The demo uses a simpler configuration:
behavior: scaleDown: stabilizationWindowSeconds: 60This keeps the default scale-down policy but sets a 60-second stabilization window.
Stabilization Windows
Section titled “Stabilization Windows”The stabilization window prevents rapid scale-up/scale-down oscillations (flapping). The HPA controller looks at all computed replica counts within the window and picks the highest (for scale-up) or lowest (for scale-down).
Scale-Down Stabilization
Section titled “Scale-Down Stabilization”The default scale-down stabilization window is 300 seconds (5 minutes). The demo overrides it to 60 seconds for faster demonstrations:
behavior: scaleDown: stabilizationWindowSeconds: 60In production, 5 minutes is often appropriate. Traffic spikes can be followed by brief lulls. Without stabilization, the HPA might scale down during the lull and then immediately scale back up when traffic returns.
Scale-Up Stabilization
Section titled “Scale-Up Stabilization”The default scale-up stabilization window is 0 seconds. Scale-up happens as fast as possible. This is usually what you want. Scaling up quickly handles traffic spikes. Scaling down slowly prevents unnecessary pod churn.
If your metrics are noisy, you might increase the scale-up window:
behavior: scaleUp: stabilizationWindowSeconds: 60This waits 60 seconds of sustained high metrics before scaling up.
The Load Generator
Section titled “The Load Generator”The demo uses a simple load generator:
apiVersion: v1kind: Podmetadata: name: load-generator namespace: hpa-demospec: containers: - name: busybox image: busybox:1.36 command: - /bin/sh - -c - | echo "Starting load generation against http://cpu-burner..." while true; do wget -q -O- http://cpu-burner > /dev/null 2>&1 doneThis tight loop sends HTTP requests as fast as possible to the cpu-burner Service. The
hpa-example container performs a CPU-intensive computation on each request, driving
utilization above the 50% target.
Deleting the load generator pod stops the load. CPU drops. After the stabilization window
(60 seconds in the demo), the HPA scales back to minReplicas: 1.
VPA vs HPA
Section titled “VPA vs HPA”The Vertical Pod Autoscaler (VPA) and Horizontal Pod Autoscaler (HPA) solve different problems.
Adds or removes pod replicas. Scales horizontally. Works well for stateless workloads that can distribute load across multiple instances.
Adjusts CPU and memory requests/limits on existing pods. Scales vertically. Works well for workloads that cannot scale horizontally (single-instance databases) or workloads where the right resource request is hard to predict.
Using Both Together
Section titled “Using Both Together”Running HPA and VPA on the same metric (CPU) causes conflicts. Both try to adjust the workload, and they can fight each other. The VPA increases requests, which changes the HPA’s utilization calculation, which causes the HPA to scale, which changes the VPA’s recommendation.
Safe combinations:
- HPA on CPU, VPA on memory (in recommendation mode). The VPA suggests memory adjustments but does not auto-apply them.
- HPA on custom metrics, VPA on CPU/memory. They target different metrics and do not interfere.
- Multidimensional Pod Autoscaler (MPA). Some platforms combine VPA and HPA into a single controller that coordinates both dimensions.
Production Tuning Tips
Section titled “Production Tuning Tips”Set Realistic Min and Max Replicas
Section titled “Set Realistic Min and Max Replicas”minReplicasshould handle baseline traffic without scaling. If your app needs at least 2 pods for redundancy, setminReplicas: 2.maxReplicasshould account for cluster capacity. SettingmaxReplicas: 1000on a 10-node cluster is misleading.
Use Conservative Scale-Down
Section titled “Use Conservative Scale-Down”behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 25 periodSeconds: 120Scale down by at most 25% every 2 minutes. This prevents aggressive scale-down during brief traffic lulls.
Monitor HPA Events
Section titled “Monitor HPA Events”kubectl describe hpa cpu-burner -n hpa-demoThe events section shows scaling decisions, including which metric triggered the scale and what the computed replica count was.
Container Resource Targets
Section titled “Container Resource Targets”The containerResource metric type (Kubernetes 1.27+) lets you target a specific container
in a multi-container pod:
metrics: - type: ContainerResource containerResource: name: cpu container: app target: type: Utilization averageUtilization: 70This is useful when sidecar containers (like Istio envoy) consume CPU but should not influence scaling decisions.
Avoid Scaling to Zero
Section titled “Avoid Scaling to Zero”The HPA has a hard floor of minReplicas: 1. It cannot scale to zero. For scale-to-zero
workloads, use KEDA (Kubernetes Event-Driven Autoscaling), which wraps the HPA and adds
zero-replica support.
How the Control Loop Works End-to-End
Section titled “How the Control Loop Works End-to-End”- Metrics server scrapes CPU usage from kubelets.
- HPA controller queries the
metrics.k8s.ioAPI every 15 seconds. - Controller computes
desiredReplicas = ceil(currentReplicas * (currentCPU / targetCPU)). - Controller checks if the ratio is within the tolerance band (0.9 to 1.1).
- Controller applies stabilization window logic (look at history, pick max or min).
- Controller applies scaling policies (max pods/percent per period).
- Controller clamps the result between
minReplicasandmaxReplicas. - Controller patches the Deployment’s
spec.replicasfield. - Deployment controller creates or deletes pods to match.
The entire cycle takes 15-90 seconds depending on metrics lag and pipeline latency.
Connection to the Demo
Section titled “Connection to the Demo”The demo walks through the full lifecycle:
- Deploy: App starts with 1 replica and 200m CPU request.
- Verify: HPA reads 0% utilization (no load).
- Load: The load generator drives CPU to 100%.
- Scale up: HPA computes
ceil(1 * (100/50)) = 2, then continues scaling as load distributes. - Observe: Replicas increase until CPU per pod drops below 50%.
- Stop load: Delete the load generator.
- Scale down: After 60 seconds (the stabilization window), replicas decrease back to 1.
Common Pitfalls
Section titled “Common Pitfalls”Missing Resource Requests
Section titled “Missing Resource Requests”Without resources.requests.cpu, the HPA shows <unknown> for targets and never scales.
Always set CPU requests on pods managed by an HPA.
Metrics Server Not Running
Section titled “Metrics Server Not Running”If kubectl top nodes returns an error, the metrics server is not installed or not ready.
The HPA depends on it for resource metrics.
Overly Aggressive Scale-Down
Section titled “Overly Aggressive Scale-Down”The default 5-minute stabilization window exists for a reason. Shortening it too much causes flapping: scale down, traffic returns, scale up, traffic subsides, scale down again.
CPU Limits Causing Throttling
Section titled “CPU Limits Causing Throttling”If a pod’s CPU limit is too low, the container gets throttled even when the node has spare CPU. The throttled pod appears to use 100% of its limit, triggering the HPA. But adding more pods does not help because each new pod is also throttled. Consider raising limits or removing them entirely (keeping requests).