Vertical Pod Autoscaler: Deep Dive

This document explains how the Vertical Pod Autoscaler calculates resource recommendations, why it requires pod restarts in most modes, and when to use VPA versus HPA. It connects the demo manifests to VPA’s internal architecture and production autoscaling patterns.

Why VPA Exists

Kubernetes requires you to set resource requests on your containers. The scheduler uses these requests to place pods on nodes with enough capacity. But picking the right values is hard.

Set requests too low and your pods get OOM-killed or CPU-throttled. Set them too high and you waste cluster resources. Nodes fill up with over-provisioned pods that only use a fraction of their allocated capacity.

Before VPA, you had two options:

Guess conservatively high. Waste resources but avoid failures.
Monitor and adjust manually. Watch metrics, update manifests, redeploy.

Neither scales well. The first wastes money. The second wastes time and requires constant vigilance as traffic patterns shift.

VPA automates the monitoring and adjustment. It watches your pods, learns their actual resource usage over time, and either recommends new values or applies them automatically.

How VPA Works

VPA consists of three components:

1. Recommender

The Recommender watches pod metrics and calculates resource recommendations. It runs a control loop every minute (configurable) and queries the metrics server for CPU and memory usage data.

It builds a histogram of resource usage over the past 8 days (by default). Each histogram bucket represents a range of CPU or memory usage. The Recommender updates these histograms continuously as new metrics arrive.

From this histogram, it computes four recommendation values:

Lower Bound: The minimum resources needed for the app to function. Below this, the pod might OOM or fail to handle requests.
Target: The recommended request value. Calculated as the 90th percentile of historical usage (configurable).
Uncapped Target: What VPA would recommend without any resource policy constraints.
Upper Bound: The maximum resources VPA considers reasonable. Prevents runaway recommendations.

The percentile-based approach ensures pods have enough resources to handle most usage patterns without over-provisioning for rare spikes.

2. Updater

The Updater evicts pods that need updated resource requests. It compares current pod requests to the Recommender’s target values. If the difference exceeds a threshold (default 10%), the Updater evicts the pod.

When a pod is evicted, the Deployment/ReplicaSet controller creates a new pod. The Admission Controller (below) mutates the new pod’s spec with the updated requests before it reaches the scheduler.

The Updater respects PodDisruptionBudgets (PDBs). If evicting a pod would violate a PDB, the Updater waits.

The Updater only runs in Auto and Recreate modes. In Off and Initial modes, it does nothing.

3. Admission Controller

The Admission Controller is a mutating webhook. It intercepts pod creation requests and modifies the resource requests to match VPA recommendations.

When Kubernetes receives a request to create a pod (from a Deployment, StatefulSet, etc.), the API server calls the VPA Admission Controller webhook before persisting the pod. The webhook looks up any VPA targeting this pod’s owner and injects the recommended resource requests.

This happens transparently. The Deployment’s manifest still shows the original requests, but the running pods have the VPA’s values.

In Auto and Recreate modes, the Admission Controller works with the Updater. The Updater evicts pods with outdated requests, and the Admission Controller injects new values when replacements are created.

In Initial mode, the Admission Controller only runs for new pods. Existing pods keep their original requests until manually restarted.

The Four Update Modes

VPA’s updateMode field controls how aggressively it applies recommendations.

Off

# From manifests/vpa.yaml
spec:
  updateMode: "Off"

VPA only provides recommendations. It does not modify pods. Use this mode to see what VPA would suggest before committing to automatic updates.

Check recommendations with:

kubectl describe vpa resource-consumer-vpa -n vpa-demo

You will see the four recommendation values (Lower Bound, Target, Uncapped Target, Upper Bound) but no pod changes.

This is the safest mode. It lets you validate VPA’s behavior without disrupting running workloads.

Initial

spec:
  updateMode: "Initial"

VPA sets resource requests when pods are first created but does not update existing pods. The Admission Controller runs, the Updater does not.

Use this mode for StatefulSets or other workloads where pod restarts are expensive. New pods get optimized requests. Old pods keep their original values until you manually trigger a rollout.

Recreate

spec:
  updateMode: "Recreate"

VPA evicts and recreates pods to apply new recommendations. The Updater evicts pods one by one. The Admission Controller injects updated requests into replacements.

This is the legacy automatic mode. It works but causes pod restarts. For Deployments, use Auto instead.

Auto

# From manifests/vpa-auto.yaml
spec:
  updateMode: "Auto"

VPA evicts and recreates pods, respecting PodDisruptionBudgets and using the least disruptive method available. This is the recommended mode for automatic updates.

Currently, Auto behaves identically to Recreate for Deployments. Future Kubernetes versions may support in-place resource updates (resize without restart), at which point Auto will prefer the less disruptive in-place method.

Key Configuration Fields

Target Reference

# From manifests/vpa.yaml
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resource-consumer

This tells VPA which workload to manage. VPA supports Deployments, StatefulSets, DaemonSets, ReplicaSets, and Jobs.

You can only attach one VPA to a given workload. Creating two VPAs targeting the same Deployment causes undefined behavior.

Resource Policy

# From manifests/vpa.yaml
spec:
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 1000m
          memory: 512Mi

Resource policies constrain VPA’s recommendations. Without constraints, VPA might recommend absurdly high or low values based on outlier metrics.

minAllowed: VPA will never recommend requests below this. Protects against under-provisioning.

maxAllowed: VPA will never recommend requests above this. Protects against runaway recommendations.

VPA clamps the Target recommendation to these bounds. The Uncapped Target shows what VPA would recommend without policies.

You can also specify controlledResources:

resourcePolicy:
  containerPolicies:
    - containerName: app
      controlledResources: ["cpu"]
      minAllowed:
        cpu: 100m
      maxAllowed:
        cpu: 2000m

This tells VPA to only manage CPU, leaving memory requests unchanged. Useful if your app has predictable memory usage but variable CPU needs.

Default controlled resources are ["cpu", "memory"].

Container Policies

In multi-container pods, you can set per-container policies:

resourcePolicy:
  containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 1000m
        memory: 512Mi
    - containerName: sidecar
      mode: "Off"

This manages the app container’s resources but leaves the sidecar container alone. The mode field can be Auto or Off per container.

How Recommendations Are Calculated

VPA uses a histogram-based algorithm to track resource usage over time.

Histogram Decay

VPA does not keep all historical data forever. Older data decays exponentially. Recent usage has more weight than usage from a week ago.

The decay half-life is configurable (default 24 hours). This means data from 24 hours ago has half the weight of current data. Data from 48 hours ago has 1/4 weight.

This allows VPA to adapt to changing traffic patterns. If your app’s load profile shifts, VPA gradually forgets the old pattern and converges on new recommendations.

Percentile-Based Targets

VPA computes the Target recommendation as the 90th percentile of CPU usage and 95th percentile of memory usage (configurable via Recommender flags).

Why percentiles? A simple average under-provisions. If your app averages 100m CPU but occasionally spikes to 500m, an average-based recommendation would cause frequent throttling.

The 90th percentile ensures the pod has enough CPU for 90% of its historical usage samples. The remaining 10% might experience throttling, but this is usually acceptable.

Memory uses a higher percentile (95th) because memory exhaustion causes OOM kills, which are more disruptive than CPU throttling.

Lower and Upper Bounds

VPA calculates the Lower Bound as a lower percentile (50th by default) and adds a safety margin. This is the minimum the pod needs to run.

The Upper Bound is calculated as a higher percentile (95th by default) with a multiplier. This caps recommendations to prevent outlier spikes from causing excessive resource requests.

These bounds are separate from the minAllowed and maxAllowed policy constraints. VPA first computes its internal bounds, then applies policy constraints.

VPA vs HPA

When to Use HPA

The Horizontal Pod Autoscaler adds or removes replicas. It scales horizontally. Use HPA when:

Your workload is stateless and can distribute load across multiple instances.
You need to handle variable traffic (web apps, APIs, background workers).
Adding more instances improves throughput.

HPA reacts quickly (15-second control loop) and does not require pod restarts.

When to Use VPA

VPA adjusts resource requests per pod. It scales vertically. Use VPA when:

Your workload cannot scale horizontally (single-instance databases, stateful apps).
The right resource request is hard to predict at deploy time.
You want to optimize resource utilization across many workloads without manual tuning.

VPA requires pod restarts in most modes (except future in-place resize support) and has a slower control loop (1-minute default).

Can They Work Together?

Running VPA and HPA on the same workload is risky if both target the same metric.

Conflict scenario: VPA and HPA both react to CPU.

HPA sees high CPU, scales from 1 to 3 replicas.
VPA sees high CPU, increases CPU requests.
Higher requests change the HPA’s utilization calculation (utilization = usage / request).
HPA might scale down because utilization looks lower with higher requests.
VPA might scale up again because fewer pods mean higher per-pod usage.

Safe combinations:

HPA on CPU, VPA in Off or Initial mode. VPA suggests memory adjustments but does not auto-apply them.
HPA on custom metrics (like HTTP requests per second), VPA on CPU/memory. They target different dimensions.
VPA controls CPU, HPA controls memory (rare). Requires careful testing.

The Kubernetes community is working on a combined autoscaler (Multidimensional Pod Autoscaler) to coordinate both dimensions, but it is not yet stable.

For now, pick one. Use HPA for stateless workloads that need fast scaling. Use VPA for stateful workloads or environments where you want set-and-forget resource optimization.

Production Considerations

Pod Restarts and Downtime

In Auto and Recreate modes, VPA evicts pods to apply new recommendations. Each eviction causes a pod restart.

For Deployments with multiple replicas, VPA evicts one pod at a time, respecting PodDisruptionBudgets. The workload stays available.

For StatefulSets with a single replica (like a database), evicting the pod causes downtime. Use Off or Initial mode and schedule updates during maintenance windows.

Frequency of Updates

VPA does not update pods on every metric change. It uses a threshold (default 10% difference between current and recommended requests) to avoid constant churn.

If your app’s resource usage is stable, VPA might never update it. If usage fluctuates, VPA updates gradually as the histogram shifts.

OOM Handling

If a pod OOMs, VPA learns from it. The next recommendation includes the OOM event as a data point. VPA increases the memory recommendation to prevent future OOMs.

However, VPA cannot prevent the initial OOM. If you deploy a new workload with very low memory requests, it might OOM before VPA gathers enough data. Set reasonable initial requests.

Resource Policy Tuning

Choose minAllowed and maxAllowed based on your cluster capacity and cost tolerance.

Example for a web app on a cost-sensitive cluster:

resourcePolicy:
  containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 10m
        memory: 32Mi
      maxAllowed:
        cpu: 500m
        memory: 1Gi

This prevents VPA from recommending more than 500m CPU or 1Gi memory per pod. If the app needs more, you manually adjust the policy.

Multi-Cluster Environments

VPA recommendations are cluster-specific. If you run the same app in dev and prod clusters, VPA might recommend different values based on traffic patterns.

Do not copy VPA recommendations across clusters. Let each cluster’s VPA learn from its own metrics.

Monitoring VPA

Check VPA status with:

kubectl describe vpa resource-consumer-vpa -n vpa-demo

Look for:

Recommendation: Current Target values.
Conditions: Errors or warnings (like missing metrics).

You can also export VPA metrics to Prometheus. The VPA Recommender exposes metrics at /metrics, including recommendation values and update counts.

Common Pitfalls

VPA and HPA on the Same Metric

As discussed, this causes conflicts. VPA changes requests, HPA recalculates utilization, both fight.

Symptom: Replicas and resource requests oscillate. Pods constantly restart.

Fix: Pick one. Use HPA on a custom metric if you need both.

JVM Applications

JVM apps allocate a heap at startup based on available memory. If VPA increases memory requests after the JVM starts, the heap size does not change. The extra memory goes unused.

Fix: Configure the JVM to respect container limits. Use -XX:+UseContainerSupport (Java 8u191+) so the JVM dynamically adjusts heap size based on cgroup limits. Even better, configure the JVM to use a percentage of available memory:

-XX:MaxRAMPercentage=75.0

This way, when VPA increases memory, the JVM uses it.

Batch Jobs

VPA is designed for long-running workloads. It needs 8 days of data (by default) to build accurate histograms.

Batch jobs that run for minutes or hours do not give VPA enough data. VPA either makes no recommendations or bases them on too few samples.

Fix: Use Off mode for jobs. Set requests based on test runs or historical job metrics outside of VPA.

Missing Metrics Server

VPA depends on the metrics server for CPU and memory data. If the metrics server is not running, VPA has no input.

Check with:

kubectl top pods -n vpa-demo

If this fails, install the metrics server:

minikube addons enable metrics-server

VPA and StatefulSets

VPA can evict StatefulSet pods, but StatefulSets roll out updates sequentially by default. If you have a 10-replica StatefulSet and VPA evicts a pod, the StatefulSet waits for that pod to become ready before moving to the next.

This makes VPA updates slow for large StatefulSets. Consider using Initial mode instead, and manually trigger updates.

Limits vs Requests

VPA only manages resource requests. It does not touch limits by default.

If your pod has requests.cpu: 100m and limits.cpu: 200m, VPA might recommend increasing requests to 300m. The limit stays at 200m. Now the pod’s request exceeds its limit, which is invalid.

Kubernetes patches the limit to match the request, but this might not be what you want. If you use limits, consider enabling VPA’s limit management (experimental feature as of Kubernetes 1.27).

Alternatively, remove limits entirely and rely only on requests. Many production clusters do this to avoid CPU throttling.

Real-World Example

The demo’s deployment starts with very low requests:

# From manifests/deployment.yaml
resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 500m
    memory: 256Mi

But the stress container actually uses 100M of memory and pegs one CPU core. With cpu: 1 in the stress args, the container tries to use 1000m CPU.

The pod gets throttled at 500m (the limit) and uses far more than the 50m request. VPA sees this and recommends higher values.

After a few minutes in Off mode:

kubectl describe vpa resource-consumer-vpa -n vpa-demo

You might see:

Recommendation:
  Container Recommendations:
    Container Name:  app
    Lower Bound:
      Cpu:     100m
      Memory:  128Mi
    Target:
      Cpu:     250m
      Memory:  140Mi
    Uncapped Target:
      Cpu:     250m
      Memory:  140Mi
    Upper Bound:
      Cpu:     500m
      Memory:  256Mi

VPA learned that the pod needs more than 50m CPU and 64Mi memory. The Target recommendation suggests 250m and 140Mi.

Switching to Auto mode:

kubectl delete vpa resource-consumer-vpa -n vpa-demo
kubectl apply -f manifests/vpa-auto.yaml

VPA evicts the pods. The Admission Controller injects the new requests. Check the running pod:

kubectl get pod <pod-name> -n vpa-demo -o jsonpath='{.spec.containers[0].resources}'

You see requests close to the Target recommendation.

The VPA Control Loop

Metrics Collection: VPA Recommender queries the metrics server every 60 seconds.
Histogram Update: New CPU and memory samples are added to the histogram. Old samples decay.
Recommendation Calculation: VPA computes Lower Bound, Target, Uncapped Target, and Upper Bound from histogram percentiles.
Policy Application: VPA clamps Target to minAllowed and maxAllowed.
Updater Decision: VPA Updater checks if current pod requests differ from Target by more than 10%.
Eviction: If yes, Updater evicts the pod (in Auto or Recreate mode).
Admission: When the replacement pod is created, Admission Controller injects the new requests.
Repeat: The cycle continues. As usage patterns change, recommendations adapt.

The entire loop is slower than HPA (minutes vs seconds) because VPA needs more data to make stable decisions. Resource requests are expensive to change (pod restart), so VPA waits for high confidence.

Connection to the Demo

The demo walks through VPA’s lifecycle:

Deploy: A stress workload with low CPU and memory requests.
VPA in Off mode: Watch recommendations without changes. VPA suggests higher values.
Switch to Auto mode: VPA evicts and recreates pods with updated requests.
Verify: Compare old and new requests. Check that recommendations respect resource policies.
Compare to HPA: Understand when to use VPA (right-sizing) vs HPA (scaling out).

The demo uses a stress container to exaggerate the difference between initial requests and actual usage. In production, the gap is usually smaller, but VPA still helps optimize thousands of pods across a cluster.