Skip to content

Horizontal Pod Autoscaler (HPA)

Automatically scale pods up and down based on CPU utilization.

Time: ~15 minutes Difficulty: Intermediate

  • HPA: scaling replicas based on metrics
  • CPU utilization targets and scaling thresholds
  • Scale-up and scale-down behavior
  • Stabilization windows to prevent flapping
  • Why resource requests are required for HPA to work

Metrics server must be enabled:

Terminal window
minikube addons enable metrics-server

Wait a minute for metrics to start flowing:

Terminal window
kubectl top nodes

Navigate to the demo directory:

Terminal window
cd demos/hpa
Terminal window
kubectl apply -f manifests/namespace.yaml
kubectl apply -f manifests/app.yaml
kubectl apply -f manifests/hpa.yaml

Verify the HPA can read metrics (may show <unknown> for a minute):

Terminal window
kubectl get hpa -n hpa-demo -w

Wait until TARGETS shows an actual percentage (e.g., 0%/50%).

Terminal window
kubectl apply -f manifests/load-generator.yaml

Now watch the HPA react in a separate terminal:

Terminal window
kubectl get hpa -n hpa-demo -w

Within 1-2 minutes, you should see:

  1. CPU utilization climbing above 50%
  2. REPLICAS increasing from 1 to 2, then 3, then more
  3. CPU utilization stabilizing as load is spread across pods

Watch pods scale up:

Terminal window
kubectl get pods -n hpa-demo -w
Terminal window
kubectl delete pod load-generator -n hpa-demo

Watch the HPA scale back down (takes about 60 seconds due to the stabilization window):

Terminal window
kubectl get hpa -n hpa-demo -w
manifests/
namespace.yaml # hpa-demo namespace
app.yaml # Deployment + Service (CPU-intensive app)
hpa.yaml # HPA targeting 50% CPU utilization
load-generator.yaml # Pod that hammers the app with requests

How the HPA decides to scale:

  1. Metrics server collects CPU usage from each pod every 15 seconds
  2. HPA checks metrics every 15 seconds (default)
  3. It calculates: desiredReplicas = ceil(currentReplicas * (currentUtilization / targetUtilization))
  4. If CPU is at 100% with target 50% and 1 replica: ceil(1 * (100/50)) = 2 replicas
  5. Scale-down waits for the stabilization window (60s) to avoid flapping

Why resource requests matter: The HPA calculates utilization as a percentage of the pod’s CPU requests. Without requests, the HPA cannot compute utilization and will not scale.

  1. Change the target to 30% and watch more aggressive scaling:

    Terminal window
    kubectl patch hpa cpu-burner -n hpa-demo \
    --type=merge -p '{"spec":{"metrics":[{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":30}}}]}}'
  2. Check HPA events to see scaling decisions:

    Terminal window
    kubectl describe hpa cpu-burner -n hpa-demo
  3. Set a longer stabilization window to see slower scale-down:

    Terminal window
    kubectl patch hpa cpu-burner -n hpa-demo \
    --type=merge -p '{"spec":{"behavior":{"scaleDown":{"stabilizationWindowSeconds":300}}}}'
Terminal window
kubectl delete namespace hpa-demo

See docs/deep-dive.md for a detailed explanation of the scaling algorithm, custom metrics (memory, requests-per-second), scale-up/scale-down policies, VPA vs HPA, and production tuning.

Move on to RBAC to learn about ServiceAccounts, Roles, and access control.