Skip to content

Horizontal Pod Autoscaler: Deep Dive

This document explains how the HPA scaling algorithm works, why resource requests are essential, and when to tune stabilization windows and scaling policies. It connects the demo manifests to the metrics pipeline and production autoscaling patterns.


The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). On each iteration, it:

  1. Fetches current metric values for all pods in the target workload.
  2. Computes the desired replica count.
  3. Scales the target if the desired count differs from the current count.

The core formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

The demo’s HPA targets 50% CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cpu-burner
namespace: hpa-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-burner
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleDown:
stabilizationWindowSeconds: 60

If the single pod is using 100% CPU against a target of 50%:

desiredReplicas = ceil(1 * (100 / 50)) = ceil(2.0) = 2

The HPA scales from 1 to 2 replicas. After scaling, if each of the 2 pods uses 50%, the ratio is 1.0 and no further scaling occurs.

If load increases and both pods hit 90%:

desiredReplicas = ceil(2 * (90 / 50)) = ceil(3.6) = 4

The HPA jumps to 4 replicas. The ceil() function always rounds up, biasing toward over-provisioning rather than under-provisioning.

The HPA does not scale for small deviations. There is a default tolerance of 10% (--horizontal-pod-autoscaler-tolerance=0.1). If the ratio of current to desired metric value is within 0.9 to 1.1, no scaling action occurs. This prevents flapping on minor fluctuations.

When multiple metrics are configured, the HPA computes the desired replica count for each metric independently and takes the maximum. This ensures the workload has enough capacity for the most demanding metric.


The demo’s app sets CPU requests:

apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-burner
namespace: hpa-demo
spec:
replicas: 1
template:
spec:
containers:
- name: app
image: registry.k8s.io/hpa-example
resources:
requests:
cpu: 200m
memory: 64Mi
limits:
cpu: 500m
memory: 128Mi

The HPA calculates CPU utilization as a percentage of the pod’s requests, not its limits. If a pod requests 200m CPU and currently uses 100m, utilization is 50%.

Without CPU requests, the HPA cannot compute utilization. The TARGETS column shows <unknown> and no scaling happens. This is a common source of confusion. You set up the HPA, generate load, and nothing happens because the Deployment has no resource requests.


The metrics server is the default metrics source for the HPA. It collects CPU and memory usage from the kubelet on every node. The kubelet gets these numbers from cAdvisor, which reads cgroup stats from the container runtime.

The flow:

Container Runtime --> cAdvisor --> Kubelet --> Metrics Server --> HPA Controller

The metrics server exposes the metrics.k8s.io API. The HPA controller queries this API to get per-pod CPU and memory usage.

In minikube, you enable it with:

Terminal window
minikube addons enable metrics-server

Metrics are not instantaneous. cAdvisor samples every 10-15 seconds. The kubelet exposes metrics every 15 seconds. The metrics server scrapes kubelets every 60 seconds. The HPA controller checks every 15 seconds.

End-to-end, there can be 30-90 seconds of lag between a load spike and a scaling action. The HPA controller accounts for this by using the most recent metric value available, but it cannot react to spikes faster than the pipeline delivers data.

Beyond CPU and memory, the HPA can scale on custom application metrics. This requires a custom metrics adapter (like Prometheus Adapter or KEDA).

metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100

This scales based on the average HTTP requests per second across all pods. If the average exceeds 100, the HPA scales up.

External metrics come from outside the cluster, like a cloud queue length:

metrics:
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: orders
target:
type: Value
value: 50

This scales based on the number of messages in an SQS queue. If the queue grows beyond 50, more pods are added to process messages faster.


CPU and memory. These are the most common.

Utilization target: Percentage of requests.

metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50

AverageValue target: Absolute value per pod.

metrics:
- type: Resource
resource:
name: cpu
target:
type: AverageValue
averageValue: 200m

CPU scales well. When CPU usage rises, adding more pods distributes the load. When load drops, CPU usage falls and the HPA scales down.

Memory is trickier. Many applications allocate memory at startup and never release it. Adding more pods reduces per-pod load, but existing pods may not free memory. The HPA might scale up successfully but never scale down because memory stays high.

For this reason, most production HPAs scale on CPU or custom request-rate metrics. Memory-based scaling works best for workloads with proportional memory usage (like in-memory caches that grow with request volume).


The behavior field gives fine-grained control over scaling speed.

behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max

This says: “In any 60-second window, scale up by at most 100% of current replicas OR 4 pods, whichever is larger.” With 2 current replicas, 100% = 2 pods and 4 pods = 4 pods, so the max is 4.

selectPolicy: Max takes the larger of the two policies. Use Min to take the smaller (more conservative) value. Use Disabled to prevent scaling up entirely.

behavior:
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 120

In any 120-second window, scale down by at most 50% of current replicas. With 10 replicas, this means at most 5 pods are removed per 2-minute period.

The demo uses a simpler configuration:

behavior:
scaleDown:
stabilizationWindowSeconds: 60

This keeps the default scale-down policy but sets a 60-second stabilization window.


The stabilization window prevents rapid scale-up/scale-down oscillations (flapping). The HPA controller looks at all computed replica counts within the window and picks the highest (for scale-up) or lowest (for scale-down).

The default scale-down stabilization window is 300 seconds (5 minutes). The demo overrides it to 60 seconds for faster demonstrations:

behavior:
scaleDown:
stabilizationWindowSeconds: 60

In production, 5 minutes is often appropriate. Traffic spikes can be followed by brief lulls. Without stabilization, the HPA might scale down during the lull and then immediately scale back up when traffic returns.

The default scale-up stabilization window is 0 seconds. Scale-up happens as fast as possible. This is usually what you want. Scaling up quickly handles traffic spikes. Scaling down slowly prevents unnecessary pod churn.

If your metrics are noisy, you might increase the scale-up window:

behavior:
scaleUp:
stabilizationWindowSeconds: 60

This waits 60 seconds of sustained high metrics before scaling up.


The demo uses a simple load generator:

apiVersion: v1
kind: Pod
metadata:
name: load-generator
namespace: hpa-demo
spec:
containers:
- name: busybox
image: busybox:1.36
command:
- /bin/sh
- -c
- |
echo "Starting load generation against http://cpu-burner..."
while true; do
wget -q -O- http://cpu-burner > /dev/null 2>&1
done

This tight loop sends HTTP requests as fast as possible to the cpu-burner Service. The hpa-example container performs a CPU-intensive computation on each request, driving utilization above the 50% target.

Deleting the load generator pod stops the load. CPU drops. After the stabilization window (60 seconds in the demo), the HPA scales back to minReplicas: 1.


The Vertical Pod Autoscaler (VPA) and Horizontal Pod Autoscaler (HPA) solve different problems.

Adds or removes pod replicas. Scales horizontally. Works well for stateless workloads that can distribute load across multiple instances.

Adjusts CPU and memory requests/limits on existing pods. Scales vertically. Works well for workloads that cannot scale horizontally (single-instance databases) or workloads where the right resource request is hard to predict.

Running HPA and VPA on the same metric (CPU) causes conflicts. Both try to adjust the workload, and they can fight each other. The VPA increases requests, which changes the HPA’s utilization calculation, which causes the HPA to scale, which changes the VPA’s recommendation.

Safe combinations:

  • HPA on CPU, VPA on memory (in recommendation mode). The VPA suggests memory adjustments but does not auto-apply them.
  • HPA on custom metrics, VPA on CPU/memory. They target different metrics and do not interfere.
  • Multidimensional Pod Autoscaler (MPA). Some platforms combine VPA and HPA into a single controller that coordinates both dimensions.

  • minReplicas should handle baseline traffic without scaling. If your app needs at least 2 pods for redundancy, set minReplicas: 2.
  • maxReplicas should account for cluster capacity. Setting maxReplicas: 1000 on a 10-node cluster is misleading.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120

Scale down by at most 25% every 2 minutes. This prevents aggressive scale-down during brief traffic lulls.

Terminal window
kubectl describe hpa cpu-burner -n hpa-demo

The events section shows scaling decisions, including which metric triggered the scale and what the computed replica count was.

The containerResource metric type (Kubernetes 1.27+) lets you target a specific container in a multi-container pod:

metrics:
- type: ContainerResource
containerResource:
name: cpu
container: app
target:
type: Utilization
averageUtilization: 70

This is useful when sidecar containers (like Istio envoy) consume CPU but should not influence scaling decisions.

The HPA has a hard floor of minReplicas: 1. It cannot scale to zero. For scale-to-zero workloads, use KEDA (Kubernetes Event-Driven Autoscaling), which wraps the HPA and adds zero-replica support.


  1. Metrics server scrapes CPU usage from kubelets.
  2. HPA controller queries the metrics.k8s.io API every 15 seconds.
  3. Controller computes desiredReplicas = ceil(currentReplicas * (currentCPU / targetCPU)).
  4. Controller checks if the ratio is within the tolerance band (0.9 to 1.1).
  5. Controller applies stabilization window logic (look at history, pick max or min).
  6. Controller applies scaling policies (max pods/percent per period).
  7. Controller clamps the result between minReplicas and maxReplicas.
  8. Controller patches the Deployment’s spec.replicas field.
  9. Deployment controller creates or deletes pods to match.

The entire cycle takes 15-90 seconds depending on metrics lag and pipeline latency.


The demo walks through the full lifecycle:

  1. Deploy: App starts with 1 replica and 200m CPU request.
  2. Verify: HPA reads 0% utilization (no load).
  3. Load: The load generator drives CPU to 100%.
  4. Scale up: HPA computes ceil(1 * (100/50)) = 2, then continues scaling as load distributes.
  5. Observe: Replicas increase until CPU per pod drops below 50%.
  6. Stop load: Delete the load generator.
  7. Scale down: After 60 seconds (the stabilization window), replicas decrease back to 1.

Without resources.requests.cpu, the HPA shows <unknown> for targets and never scales. Always set CPU requests on pods managed by an HPA.

If kubectl top nodes returns an error, the metrics server is not installed or not ready. The HPA depends on it for resource metrics.

The default 5-minute stabilization window exists for a reason. Shortening it too much causes flapping: scale down, traffic returns, scale up, traffic subsides, scale down again.

If a pod’s CPU limit is too low, the container gets throttled even when the node has spare CPU. The throttled pod appears to use 100% of its limit, triggering the HPA. But adding more pods does not help because each new pod is also throttled. Consider raising limits or removing them entirely (keeping requests).