Skip to content

Vertical Pod Autoscaler: Deep Dive

This document explains how the Vertical Pod Autoscaler calculates resource recommendations, why it requires pod restarts in most modes, and when to use VPA versus HPA. It connects the demo manifests to VPA’s internal architecture and production autoscaling patterns.


Kubernetes requires you to set resource requests on your containers. The scheduler uses these requests to place pods on nodes with enough capacity. But picking the right values is hard.

Set requests too low and your pods get OOM-killed or CPU-throttled. Set them too high and you waste cluster resources. Nodes fill up with over-provisioned pods that only use a fraction of their allocated capacity.

Before VPA, you had two options:

  1. Guess conservatively high. Waste resources but avoid failures.
  2. Monitor and adjust manually. Watch metrics, update manifests, redeploy.

Neither scales well. The first wastes money. The second wastes time and requires constant vigilance as traffic patterns shift.

VPA automates the monitoring and adjustment. It watches your pods, learns their actual resource usage over time, and either recommends new values or applies them automatically.


VPA consists of three components:

The Recommender watches pod metrics and calculates resource recommendations. It runs a control loop every minute (configurable) and queries the metrics server for CPU and memory usage data.

It builds a histogram of resource usage over the past 8 days (by default). Each histogram bucket represents a range of CPU or memory usage. The Recommender updates these histograms continuously as new metrics arrive.

From this histogram, it computes four recommendation values:

  • Lower Bound: The minimum resources needed for the app to function. Below this, the pod might OOM or fail to handle requests.
  • Target: The recommended request value. Calculated as the 90th percentile of historical usage (configurable).
  • Uncapped Target: What VPA would recommend without any resource policy constraints.
  • Upper Bound: The maximum resources VPA considers reasonable. Prevents runaway recommendations.

The percentile-based approach ensures pods have enough resources to handle most usage patterns without over-provisioning for rare spikes.

The Updater evicts pods that need updated resource requests. It compares current pod requests to the Recommender’s target values. If the difference exceeds a threshold (default 10%), the Updater evicts the pod.

When a pod is evicted, the Deployment/ReplicaSet controller creates a new pod. The Admission Controller (below) mutates the new pod’s spec with the updated requests before it reaches the scheduler.

The Updater respects PodDisruptionBudgets (PDBs). If evicting a pod would violate a PDB, the Updater waits.

The Updater only runs in Auto and Recreate modes. In Off and Initial modes, it does nothing.

The Admission Controller is a mutating webhook. It intercepts pod creation requests and modifies the resource requests to match VPA recommendations.

When Kubernetes receives a request to create a pod (from a Deployment, StatefulSet, etc.), the API server calls the VPA Admission Controller webhook before persisting the pod. The webhook looks up any VPA targeting this pod’s owner and injects the recommended resource requests.

This happens transparently. The Deployment’s manifest still shows the original requests, but the running pods have the VPA’s values.

In Auto and Recreate modes, the Admission Controller works with the Updater. The Updater evicts pods with outdated requests, and the Admission Controller injects new values when replacements are created.

In Initial mode, the Admission Controller only runs for new pods. Existing pods keep their original requests until manually restarted.


VPA’s updateMode field controls how aggressively it applies recommendations.

# From manifests/vpa.yaml
spec:
updateMode: "Off"

VPA only provides recommendations. It does not modify pods. Use this mode to see what VPA would suggest before committing to automatic updates.

Check recommendations with:

Terminal window
kubectl describe vpa resource-consumer-vpa -n vpa-demo

You will see the four recommendation values (Lower Bound, Target, Uncapped Target, Upper Bound) but no pod changes.

This is the safest mode. It lets you validate VPA’s behavior without disrupting running workloads.

spec:
updateMode: "Initial"

VPA sets resource requests when pods are first created but does not update existing pods. The Admission Controller runs, the Updater does not.

Use this mode for StatefulSets or other workloads where pod restarts are expensive. New pods get optimized requests. Old pods keep their original values until you manually trigger a rollout.

spec:
updateMode: "Recreate"

VPA evicts and recreates pods to apply new recommendations. The Updater evicts pods one by one. The Admission Controller injects updated requests into replacements.

This is the legacy automatic mode. It works but causes pod restarts. For Deployments, use Auto instead.

# From manifests/vpa-auto.yaml
spec:
updateMode: "Auto"

VPA evicts and recreates pods, respecting PodDisruptionBudgets and using the least disruptive method available. This is the recommended mode for automatic updates.

Currently, Auto behaves identically to Recreate for Deployments. Future Kubernetes versions may support in-place resource updates (resize without restart), at which point Auto will prefer the less disruptive in-place method.


# From manifests/vpa.yaml
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: resource-consumer

This tells VPA which workload to manage. VPA supports Deployments, StatefulSets, DaemonSets, ReplicaSets, and Jobs.

You can only attach one VPA to a given workload. Creating two VPAs targeting the same Deployment causes undefined behavior.

# From manifests/vpa.yaml
spec:
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 1000m
memory: 512Mi

Resource policies constrain VPA’s recommendations. Without constraints, VPA might recommend absurdly high or low values based on outlier metrics.

minAllowed: VPA will never recommend requests below this. Protects against under-provisioning.

maxAllowed: VPA will never recommend requests above this. Protects against runaway recommendations.

VPA clamps the Target recommendation to these bounds. The Uncapped Target shows what VPA would recommend without policies.

You can also specify controlledResources:

resourcePolicy:
containerPolicies:
- containerName: app
controlledResources: ["cpu"]
minAllowed:
cpu: 100m
maxAllowed:
cpu: 2000m

This tells VPA to only manage CPU, leaving memory requests unchanged. Useful if your app has predictable memory usage but variable CPU needs.

Default controlled resources are ["cpu", "memory"].

In multi-container pods, you can set per-container policies:

resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 1000m
memory: 512Mi
- containerName: sidecar
mode: "Off"

This manages the app container’s resources but leaves the sidecar container alone. The mode field can be Auto or Off per container.


VPA uses a histogram-based algorithm to track resource usage over time.

VPA does not keep all historical data forever. Older data decays exponentially. Recent usage has more weight than usage from a week ago.

The decay half-life is configurable (default 24 hours). This means data from 24 hours ago has half the weight of current data. Data from 48 hours ago has 1/4 weight.

This allows VPA to adapt to changing traffic patterns. If your app’s load profile shifts, VPA gradually forgets the old pattern and converges on new recommendations.

VPA computes the Target recommendation as the 90th percentile of CPU usage and 95th percentile of memory usage (configurable via Recommender flags).

Why percentiles? A simple average under-provisions. If your app averages 100m CPU but occasionally spikes to 500m, an average-based recommendation would cause frequent throttling.

The 90th percentile ensures the pod has enough CPU for 90% of its historical usage samples. The remaining 10% might experience throttling, but this is usually acceptable.

Memory uses a higher percentile (95th) because memory exhaustion causes OOM kills, which are more disruptive than CPU throttling.

VPA calculates the Lower Bound as a lower percentile (50th by default) and adds a safety margin. This is the minimum the pod needs to run.

The Upper Bound is calculated as a higher percentile (95th by default) with a multiplier. This caps recommendations to prevent outlier spikes from causing excessive resource requests.

These bounds are separate from the minAllowed and maxAllowed policy constraints. VPA first computes its internal bounds, then applies policy constraints.


The Horizontal Pod Autoscaler adds or removes replicas. It scales horizontally. Use HPA when:

  • Your workload is stateless and can distribute load across multiple instances.
  • You need to handle variable traffic (web apps, APIs, background workers).
  • Adding more instances improves throughput.

HPA reacts quickly (15-second control loop) and does not require pod restarts.

VPA adjusts resource requests per pod. It scales vertically. Use VPA when:

  • Your workload cannot scale horizontally (single-instance databases, stateful apps).
  • The right resource request is hard to predict at deploy time.
  • You want to optimize resource utilization across many workloads without manual tuning.

VPA requires pod restarts in most modes (except future in-place resize support) and has a slower control loop (1-minute default).

Running VPA and HPA on the same workload is risky if both target the same metric.

Conflict scenario: VPA and HPA both react to CPU.

  1. HPA sees high CPU, scales from 1 to 3 replicas.
  2. VPA sees high CPU, increases CPU requests.
  3. Higher requests change the HPA’s utilization calculation (utilization = usage / request).
  4. HPA might scale down because utilization looks lower with higher requests.
  5. VPA might scale up again because fewer pods mean higher per-pod usage.

Safe combinations:

  1. HPA on CPU, VPA in Off or Initial mode. VPA suggests memory adjustments but does not auto-apply them.
  2. HPA on custom metrics (like HTTP requests per second), VPA on CPU/memory. They target different dimensions.
  3. VPA controls CPU, HPA controls memory (rare). Requires careful testing.

The Kubernetes community is working on a combined autoscaler (Multidimensional Pod Autoscaler) to coordinate both dimensions, but it is not yet stable.

For now, pick one. Use HPA for stateless workloads that need fast scaling. Use VPA for stateful workloads or environments where you want set-and-forget resource optimization.


In Auto and Recreate modes, VPA evicts pods to apply new recommendations. Each eviction causes a pod restart.

For Deployments with multiple replicas, VPA evicts one pod at a time, respecting PodDisruptionBudgets. The workload stays available.

For StatefulSets with a single replica (like a database), evicting the pod causes downtime. Use Off or Initial mode and schedule updates during maintenance windows.

VPA does not update pods on every metric change. It uses a threshold (default 10% difference between current and recommended requests) to avoid constant churn.

If your app’s resource usage is stable, VPA might never update it. If usage fluctuates, VPA updates gradually as the histogram shifts.

If a pod OOMs, VPA learns from it. The next recommendation includes the OOM event as a data point. VPA increases the memory recommendation to prevent future OOMs.

However, VPA cannot prevent the initial OOM. If you deploy a new workload with very low memory requests, it might OOM before VPA gathers enough data. Set reasonable initial requests.

Choose minAllowed and maxAllowed based on your cluster capacity and cost tolerance.

Example for a web app on a cost-sensitive cluster:

resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 10m
memory: 32Mi
maxAllowed:
cpu: 500m
memory: 1Gi

This prevents VPA from recommending more than 500m CPU or 1Gi memory per pod. If the app needs more, you manually adjust the policy.

VPA recommendations are cluster-specific. If you run the same app in dev and prod clusters, VPA might recommend different values based on traffic patterns.

Do not copy VPA recommendations across clusters. Let each cluster’s VPA learn from its own metrics.

Check VPA status with:

Terminal window
kubectl describe vpa resource-consumer-vpa -n vpa-demo

Look for:

  • Recommendation: Current Target values.
  • Conditions: Errors or warnings (like missing metrics).

You can also export VPA metrics to Prometheus. The VPA Recommender exposes metrics at /metrics, including recommendation values and update counts.


As discussed, this causes conflicts. VPA changes requests, HPA recalculates utilization, both fight.

Symptom: Replicas and resource requests oscillate. Pods constantly restart.

Fix: Pick one. Use HPA on a custom metric if you need both.

JVM apps allocate a heap at startup based on available memory. If VPA increases memory requests after the JVM starts, the heap size does not change. The extra memory goes unused.

Fix: Configure the JVM to respect container limits. Use -XX:+UseContainerSupport (Java 8u191+) so the JVM dynamically adjusts heap size based on cgroup limits. Even better, configure the JVM to use a percentage of available memory:

Terminal window
-XX:MaxRAMPercentage=75.0

This way, when VPA increases memory, the JVM uses it.

VPA is designed for long-running workloads. It needs 8 days of data (by default) to build accurate histograms.

Batch jobs that run for minutes or hours do not give VPA enough data. VPA either makes no recommendations or bases them on too few samples.

Fix: Use Off mode for jobs. Set requests based on test runs or historical job metrics outside of VPA.

VPA depends on the metrics server for CPU and memory data. If the metrics server is not running, VPA has no input.

Check with:

Terminal window
kubectl top pods -n vpa-demo

If this fails, install the metrics server:

Terminal window
minikube addons enable metrics-server

VPA can evict StatefulSet pods, but StatefulSets roll out updates sequentially by default. If you have a 10-replica StatefulSet and VPA evicts a pod, the StatefulSet waits for that pod to become ready before moving to the next.

This makes VPA updates slow for large StatefulSets. Consider using Initial mode instead, and manually trigger updates.

VPA only manages resource requests. It does not touch limits by default.

If your pod has requests.cpu: 100m and limits.cpu: 200m, VPA might recommend increasing requests to 300m. The limit stays at 200m. Now the pod’s request exceeds its limit, which is invalid.

Kubernetes patches the limit to match the request, but this might not be what you want. If you use limits, consider enabling VPA’s limit management (experimental feature as of Kubernetes 1.27).

Alternatively, remove limits entirely and rely only on requests. Many production clusters do this to avoid CPU throttling.


The demo’s deployment starts with very low requests:

# From manifests/deployment.yaml
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 500m
memory: 256Mi

But the stress container actually uses 100M of memory and pegs one CPU core. With cpu: 1 in the stress args, the container tries to use 1000m CPU.

The pod gets throttled at 500m (the limit) and uses far more than the 50m request. VPA sees this and recommends higher values.

After a few minutes in Off mode:

Terminal window
kubectl describe vpa resource-consumer-vpa -n vpa-demo

You might see:

Recommendation:
Container Recommendations:
Container Name: app
Lower Bound:
Cpu: 100m
Memory: 128Mi
Target:
Cpu: 250m
Memory: 140Mi
Uncapped Target:
Cpu: 250m
Memory: 140Mi
Upper Bound:
Cpu: 500m
Memory: 256Mi

VPA learned that the pod needs more than 50m CPU and 64Mi memory. The Target recommendation suggests 250m and 140Mi.

Switching to Auto mode:

Terminal window
kubectl delete vpa resource-consumer-vpa -n vpa-demo
kubectl apply -f manifests/vpa-auto.yaml

VPA evicts the pods. The Admission Controller injects the new requests. Check the running pod:

Terminal window
kubectl get pod <pod-name> -n vpa-demo -o jsonpath='{.spec.containers[0].resources}'

You see requests close to the Target recommendation.


  1. Metrics Collection: VPA Recommender queries the metrics server every 60 seconds.
  2. Histogram Update: New CPU and memory samples are added to the histogram. Old samples decay.
  3. Recommendation Calculation: VPA computes Lower Bound, Target, Uncapped Target, and Upper Bound from histogram percentiles.
  4. Policy Application: VPA clamps Target to minAllowed and maxAllowed.
  5. Updater Decision: VPA Updater checks if current pod requests differ from Target by more than 10%.
  6. Eviction: If yes, Updater evicts the pod (in Auto or Recreate mode).
  7. Admission: When the replacement pod is created, Admission Controller injects the new requests.
  8. Repeat: The cycle continues. As usage patterns change, recommendations adapt.

The entire loop is slower than HPA (minutes vs seconds) because VPA needs more data to make stable decisions. Resource requests are expensive to change (pod restart), so VPA waits for high confidence.


The demo walks through VPA’s lifecycle:

  1. Deploy: A stress workload with low CPU and memory requests.
  2. VPA in Off mode: Watch recommendations without changes. VPA suggests higher values.
  3. Switch to Auto mode: VPA evicts and recreates pods with updated requests.
  4. Verify: Compare old and new requests. Check that recommendations respect resource policies.
  5. Compare to HPA: Understand when to use VPA (right-sizing) vs HPA (scaling out).

The demo uses a stress container to exaggerate the difference between initial requests and actual usage. In production, the gap is usually smaller, but VPA still helps optimize thousands of pods across a cluster.