Horizontal Pod Autoscaler: Deep Dive

This document explains how the HPA scaling algorithm works, why resource requests are essential, and when to tune stabilization windows and scaling policies. It connects the demo manifests to the metrics pipeline and production autoscaling patterns.

The Scaling Algorithm

The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). On each iteration, it:

Fetches current metric values for all pods in the target workload.
Computes the desired replica count.
Scales the target if the desired count differs from the current count.

The core formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

Example from the Demo

The demo’s HPA targets 50% CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-burner
  namespace: hpa-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cpu-burner
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60

If the single pod is using 100% CPU against a target of 50%:

desiredReplicas = ceil(1 * (100 / 50)) = ceil(2.0) = 2

The HPA scales from 1 to 2 replicas. After scaling, if each of the 2 pods uses 50%, the ratio is 1.0 and no further scaling occurs.

If load increases and both pods hit 90%:

desiredReplicas = ceil(2 * (90 / 50)) = ceil(3.6) = 4

The HPA jumps to 4 replicas. The ceil() function always rounds up, biasing toward over-provisioning rather than under-provisioning.

The Tolerance Band

The HPA does not scale for small deviations. There is a default tolerance of 10% (--horizontal-pod-autoscaler-tolerance=0.1). If the ratio of current to desired metric value is within 0.9 to 1.1, no scaling action occurs. This prevents flapping on minor fluctuations.

Multiple Metrics

When multiple metrics are configured, the HPA computes the desired replica count for each metric independently and takes the maximum. This ensures the workload has enough capacity for the most demanding metric.

Why Resource Requests Are Required

The demo’s app sets CPU requests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-burner
  namespace: hpa-demo
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: app
          image: registry.k8s.io/hpa-example
          resources:
            requests:
              cpu: 200m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 128Mi

The HPA calculates CPU utilization as a percentage of the pod’s requests, not its limits. If a pod requests 200m CPU and currently uses 100m, utilization is 50%.

Without CPU requests, the HPA cannot compute utilization. The TARGETS column shows <unknown> and no scaling happens. This is a common source of confusion. You set up the HPA, generate load, and nothing happens because the Deployment has no resource requests.

The Metrics Pipeline

Metrics Server

The metrics server is the default metrics source for the HPA. It collects CPU and memory usage from the kubelet on every node. The kubelet gets these numbers from cAdvisor, which reads cgroup stats from the container runtime.

The flow:

Container Runtime --> cAdvisor --> Kubelet --> Metrics Server --> HPA Controller

The metrics server exposes the metrics.k8s.io API. The HPA controller queries this API to get per-pod CPU and memory usage.

In minikube, you enable it with:

minikube addons enable metrics-server

Metrics Lag

Metrics are not instantaneous. cAdvisor samples every 10-15 seconds. The kubelet exposes metrics every 15 seconds. The metrics server scrapes kubelets every 60 seconds. The HPA controller checks every 15 seconds.

End-to-end, there can be 30-90 seconds of lag between a load spike and a scaling action. The HPA controller accounts for this by using the most recent metric value available, but it cannot react to spikes faster than the pipeline delivers data.

Custom Metrics

Beyond CPU and memory, the HPA can scale on custom application metrics. This requires a custom metrics adapter (like Prometheus Adapter or KEDA).

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 100

This scales based on the average HTTP requests per second across all pods. If the average exceeds 100, the HPA scales up.

External Metrics

External metrics come from outside the cluster, like a cloud queue length:

metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_length
        selector:
          matchLabels:
            queue: orders
      target:
        type: Value
        value: 50

This scales based on the number of messages in an SQS queue. If the queue grows beyond 50, more pods are added to process messages faster.

Metric Types and Targets

Resource Metrics

CPU and memory. These are the most common.

Utilization target: Percentage of requests.

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

AverageValue target: Absolute value per pod.

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: AverageValue
        averageValue: 200m

Scaling on CPU vs Memory

CPU scales well. When CPU usage rises, adding more pods distributes the load. When load drops, CPU usage falls and the HPA scales down.

Memory is trickier. Many applications allocate memory at startup and never release it. Adding more pods reduces per-pod load, but existing pods may not free memory. The HPA might scale up successfully but never scale down because memory stays high.

For this reason, most production HPAs scale on CPU or custom request-rate metrics. Memory-based scaling works best for workloads with proportional memory usage (like in-memory caches that grow with request volume).

Scale-Up and Scale-Down Behavior

The behavior field gives fine-grained control over scaling speed.

Scale-Up Policies

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
    selectPolicy: Max

This says: “In any 60-second window, scale up by at most 100% of current replicas OR 4 pods, whichever is larger.” With 2 current replicas, 100% = 2 pods and 4 pods = 4 pods, so the max is 4.

selectPolicy: Max takes the larger of the two policies. Use Min to take the smaller (more conservative) value. Use Disabled to prevent scaling up entirely.

Scale-Down Policies

behavior:
  scaleDown:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 50
        periodSeconds: 120

In any 120-second window, scale down by at most 50% of current replicas. With 10 replicas, this means at most 5 pods are removed per 2-minute period.

The demo uses a simpler configuration:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 60

This keeps the default scale-down policy but sets a 60-second stabilization window.

Stabilization Windows

The stabilization window prevents rapid scale-up/scale-down oscillations (flapping). The HPA controller looks at all computed replica counts within the window and picks the highest (for scale-up) or lowest (for scale-down).

Scale-Down Stabilization

The default scale-down stabilization window is 300 seconds (5 minutes). The demo overrides it to 60 seconds for faster demonstrations:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 60

In production, 5 minutes is often appropriate. Traffic spikes can be followed by brief lulls. Without stabilization, the HPA might scale down during the lull and then immediately scale back up when traffic returns.

Scale-Up Stabilization

The default scale-up stabilization window is 0 seconds. Scale-up happens as fast as possible. This is usually what you want. Scaling up quickly handles traffic spikes. Scaling down slowly prevents unnecessary pod churn.

If your metrics are noisy, you might increase the scale-up window:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60

This waits 60 seconds of sustained high metrics before scaling up.

The Load Generator

The demo uses a simple load generator:

apiVersion: v1
kind: Pod
metadata:
  name: load-generator
  namespace: hpa-demo
spec:
  containers:
    - name: busybox
      image: busybox:1.36
      command:
        - /bin/sh
        - -c
        - |
          echo "Starting load generation against http://cpu-burner..."
          while true; do
            wget -q -O- http://cpu-burner > /dev/null 2>&1
          done

This tight loop sends HTTP requests as fast as possible to the cpu-burner Service. The hpa-example container performs a CPU-intensive computation on each request, driving utilization above the 50% target.

Deleting the load generator pod stops the load. CPU drops. After the stabilization window (60 seconds in the demo), the HPA scales back to minReplicas: 1.

VPA vs HPA

The Vertical Pod Autoscaler (VPA) and Horizontal Pod Autoscaler (HPA) solve different problems.

HPA

Adds or removes pod replicas. Scales horizontally. Works well for stateless workloads that can distribute load across multiple instances.

VPA

Adjusts CPU and memory requests/limits on existing pods. Scales vertically. Works well for workloads that cannot scale horizontally (single-instance databases) or workloads where the right resource request is hard to predict.

Using Both Together

Running HPA and VPA on the same metric (CPU) causes conflicts. Both try to adjust the workload, and they can fight each other. The VPA increases requests, which changes the HPA’s utilization calculation, which causes the HPA to scale, which changes the VPA’s recommendation.

Safe combinations:

HPA on CPU, VPA on memory (in recommendation mode). The VPA suggests memory adjustments but does not auto-apply them.
HPA on custom metrics, VPA on CPU/memory. They target different metrics and do not interfere.
Multidimensional Pod Autoscaler (MPA). Some platforms combine VPA and HPA into a single controller that coordinates both dimensions.

Production Tuning Tips

Set Realistic Min and Max Replicas

minReplicas should handle baseline traffic without scaling. If your app needs at least 2 pods for redundancy, set minReplicas: 2.
maxReplicas should account for cluster capacity. Setting maxReplicas: 1000 on a 10-node cluster is misleading.

Use Conservative Scale-Down

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 25
        periodSeconds: 120

Scale down by at most 25% every 2 minutes. This prevents aggressive scale-down during brief traffic lulls.

Monitor HPA Events

kubectl describe hpa cpu-burner -n hpa-demo

The events section shows scaling decisions, including which metric triggered the scale and what the computed replica count was.

Container Resource Targets

The containerResource metric type (Kubernetes 1.27+) lets you target a specific container in a multi-container pod:

metrics:
  - type: ContainerResource
    containerResource:
      name: cpu
      container: app
      target:
        type: Utilization
        averageUtilization: 70

This is useful when sidecar containers (like Istio envoy) consume CPU but should not influence scaling decisions.

Avoid Scaling to Zero

The HPA has a hard floor of minReplicas: 1. It cannot scale to zero. For scale-to-zero workloads, use KEDA (Kubernetes Event-Driven Autoscaling), which wraps the HPA and adds zero-replica support.

How the Control Loop Works End-to-End

Metrics server scrapes CPU usage from kubelets.
HPA controller queries the metrics.k8s.io API every 15 seconds.
Controller computes desiredReplicas = ceil(currentReplicas * (currentCPU / targetCPU)).
Controller checks if the ratio is within the tolerance band (0.9 to 1.1).
Controller applies stabilization window logic (look at history, pick max or min).
Controller applies scaling policies (max pods/percent per period).
Controller clamps the result between minReplicas and maxReplicas.
Controller patches the Deployment’s spec.replicas field.
Deployment controller creates or deletes pods to match.

The entire cycle takes 15-90 seconds depending on metrics lag and pipeline latency.

Connection to the Demo

The demo walks through the full lifecycle:

Deploy: App starts with 1 replica and 200m CPU request.
Verify: HPA reads 0% utilization (no load).
Load: The load generator drives CPU to 100%.
Scale up: HPA computes ceil(1 * (100/50)) = 2, then continues scaling as load distributes.
Observe: Replicas increase until CPU per pod drops below 50%.
Stop load: Delete the load generator.
Scale down: After 60 seconds (the stabilization window), replicas decrease back to 1.

Common Pitfalls

Missing Resource Requests

Without resources.requests.cpu, the HPA shows <unknown> for targets and never scales. Always set CPU requests on pods managed by an HPA.

Metrics Server Not Running

If kubectl top nodes returns an error, the metrics server is not installed or not ready. The HPA depends on it for resource metrics.

Overly Aggressive Scale-Down

The default 5-minute stabilization window exists for a reason. Shortening it too much causes flapping: scale down, traffic returns, scale up, traffic subsides, scale down again.

CPU Limits Causing Throttling

If a pod’s CPU limit is too low, the container gets throttled even when the node has spare CPU. The throttled pod appears to use 100% of its limit, triggering the HPA. But adding more pods does not help because each new pod is also throttled. Consider raising limits or removing them entirely (keeping requests).