Vertical Pod Autoscaler: Deep Dive
This document explains how the Vertical Pod Autoscaler calculates resource recommendations, why it requires pod restarts in most modes, and when to use VPA versus HPA. It connects the demo manifests to VPA’s internal architecture and production autoscaling patterns.
Why VPA Exists
Section titled “Why VPA Exists”Kubernetes requires you to set resource requests on your containers. The scheduler uses these requests to place pods on nodes with enough capacity. But picking the right values is hard.
Set requests too low and your pods get OOM-killed or CPU-throttled. Set them too high and you waste cluster resources. Nodes fill up with over-provisioned pods that only use a fraction of their allocated capacity.
Before VPA, you had two options:
- Guess conservatively high. Waste resources but avoid failures.
- Monitor and adjust manually. Watch metrics, update manifests, redeploy.
Neither scales well. The first wastes money. The second wastes time and requires constant vigilance as traffic patterns shift.
VPA automates the monitoring and adjustment. It watches your pods, learns their actual resource usage over time, and either recommends new values or applies them automatically.
How VPA Works
Section titled “How VPA Works”VPA consists of three components:
1. Recommender
Section titled “1. Recommender”The Recommender watches pod metrics and calculates resource recommendations. It runs a control loop every minute (configurable) and queries the metrics server for CPU and memory usage data.
It builds a histogram of resource usage over the past 8 days (by default). Each histogram bucket represents a range of CPU or memory usage. The Recommender updates these histograms continuously as new metrics arrive.
From this histogram, it computes four recommendation values:
- Lower Bound: The minimum resources needed for the app to function. Below this, the pod might OOM or fail to handle requests.
- Target: The recommended request value. Calculated as the 90th percentile of historical usage (configurable).
- Uncapped Target: What VPA would recommend without any resource policy constraints.
- Upper Bound: The maximum resources VPA considers reasonable. Prevents runaway recommendations.
The percentile-based approach ensures pods have enough resources to handle most usage patterns without over-provisioning for rare spikes.
2. Updater
Section titled “2. Updater”The Updater evicts pods that need updated resource requests. It compares current pod requests to the Recommender’s target values. If the difference exceeds a threshold (default 10%), the Updater evicts the pod.
When a pod is evicted, the Deployment/ReplicaSet controller creates a new pod. The Admission Controller (below) mutates the new pod’s spec with the updated requests before it reaches the scheduler.
The Updater respects PodDisruptionBudgets (PDBs). If evicting a pod would violate a PDB, the Updater waits.
The Updater only runs in Auto and Recreate modes. In Off and Initial modes, it does nothing.
3. Admission Controller
Section titled “3. Admission Controller”The Admission Controller is a mutating webhook. It intercepts pod creation requests and modifies the resource requests to match VPA recommendations.
When Kubernetes receives a request to create a pod (from a Deployment, StatefulSet, etc.), the API server calls the VPA Admission Controller webhook before persisting the pod. The webhook looks up any VPA targeting this pod’s owner and injects the recommended resource requests.
This happens transparently. The Deployment’s manifest still shows the original requests, but the running pods have the VPA’s values.
In Auto and Recreate modes, the Admission Controller works with the Updater. The Updater evicts pods with outdated requests, and the Admission Controller injects new values when replacements are created.
In Initial mode, the Admission Controller only runs for new pods. Existing pods keep their original requests until manually restarted.
The Four Update Modes
Section titled “The Four Update Modes”VPA’s updateMode field controls how aggressively it applies recommendations.
# From manifests/vpa.yamlspec: updateMode: "Off"VPA only provides recommendations. It does not modify pods. Use this mode to see what VPA would suggest before committing to automatic updates.
Check recommendations with:
kubectl describe vpa resource-consumer-vpa -n vpa-demoYou will see the four recommendation values (Lower Bound, Target, Uncapped Target, Upper Bound) but no pod changes.
This is the safest mode. It lets you validate VPA’s behavior without disrupting running workloads.
Initial
Section titled “Initial”spec: updateMode: "Initial"VPA sets resource requests when pods are first created but does not update existing pods. The Admission Controller runs, the Updater does not.
Use this mode for StatefulSets or other workloads where pod restarts are expensive. New pods get optimized requests. Old pods keep their original values until you manually trigger a rollout.
Recreate
Section titled “Recreate”spec: updateMode: "Recreate"VPA evicts and recreates pods to apply new recommendations. The Updater evicts pods one by one. The Admission Controller injects updated requests into replacements.
This is the legacy automatic mode. It works but causes pod restarts. For Deployments, use Auto instead.
# From manifests/vpa-auto.yamlspec: updateMode: "Auto"VPA evicts and recreates pods, respecting PodDisruptionBudgets and using the least disruptive method available. This is the recommended mode for automatic updates.
Currently, Auto behaves identically to Recreate for Deployments. Future Kubernetes versions may support in-place resource updates (resize without restart), at which point Auto will prefer the less disruptive in-place method.
Key Configuration Fields
Section titled “Key Configuration Fields”Target Reference
Section titled “Target Reference”# From manifests/vpa.yamlspec: targetRef: apiVersion: apps/v1 kind: Deployment name: resource-consumerThis tells VPA which workload to manage. VPA supports Deployments, StatefulSets, DaemonSets, ReplicaSets, and Jobs.
You can only attach one VPA to a given workload. Creating two VPAs targeting the same Deployment causes undefined behavior.
Resource Policy
Section titled “Resource Policy”# From manifests/vpa.yamlspec: resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: 1000m memory: 512MiResource policies constrain VPA’s recommendations. Without constraints, VPA might recommend absurdly high or low values based on outlier metrics.
minAllowed: VPA will never recommend requests below this. Protects against under-provisioning.
maxAllowed: VPA will never recommend requests above this. Protects against runaway recommendations.
VPA clamps the Target recommendation to these bounds. The Uncapped Target shows what VPA would recommend without policies.
You can also specify controlledResources:
resourcePolicy: containerPolicies: - containerName: app controlledResources: ["cpu"] minAllowed: cpu: 100m maxAllowed: cpu: 2000mThis tells VPA to only manage CPU, leaving memory requests unchanged. Useful if your app has predictable memory usage but variable CPU needs.
Default controlled resources are ["cpu", "memory"].
Container Policies
Section titled “Container Policies”In multi-container pods, you can set per-container policies:
resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: 1000m memory: 512Mi - containerName: sidecar mode: "Off"This manages the app container’s resources but leaves the sidecar container alone. The mode field can be Auto or Off per container.
How Recommendations Are Calculated
Section titled “How Recommendations Are Calculated”VPA uses a histogram-based algorithm to track resource usage over time.
Histogram Decay
Section titled “Histogram Decay”VPA does not keep all historical data forever. Older data decays exponentially. Recent usage has more weight than usage from a week ago.
The decay half-life is configurable (default 24 hours). This means data from 24 hours ago has half the weight of current data. Data from 48 hours ago has 1/4 weight.
This allows VPA to adapt to changing traffic patterns. If your app’s load profile shifts, VPA gradually forgets the old pattern and converges on new recommendations.
Percentile-Based Targets
Section titled “Percentile-Based Targets”VPA computes the Target recommendation as the 90th percentile of CPU usage and 95th percentile of memory usage (configurable via Recommender flags).
Why percentiles? A simple average under-provisions. If your app averages 100m CPU but occasionally spikes to 500m, an average-based recommendation would cause frequent throttling.
The 90th percentile ensures the pod has enough CPU for 90% of its historical usage samples. The remaining 10% might experience throttling, but this is usually acceptable.
Memory uses a higher percentile (95th) because memory exhaustion causes OOM kills, which are more disruptive than CPU throttling.
Lower and Upper Bounds
Section titled “Lower and Upper Bounds”VPA calculates the Lower Bound as a lower percentile (50th by default) and adds a safety margin. This is the minimum the pod needs to run.
The Upper Bound is calculated as a higher percentile (95th by default) with a multiplier. This caps recommendations to prevent outlier spikes from causing excessive resource requests.
These bounds are separate from the minAllowed and maxAllowed policy constraints. VPA first computes its internal bounds, then applies policy constraints.
VPA vs HPA
Section titled “VPA vs HPA”When to Use HPA
Section titled “When to Use HPA”The Horizontal Pod Autoscaler adds or removes replicas. It scales horizontally. Use HPA when:
- Your workload is stateless and can distribute load across multiple instances.
- You need to handle variable traffic (web apps, APIs, background workers).
- Adding more instances improves throughput.
HPA reacts quickly (15-second control loop) and does not require pod restarts.
When to Use VPA
Section titled “When to Use VPA”VPA adjusts resource requests per pod. It scales vertically. Use VPA when:
- Your workload cannot scale horizontally (single-instance databases, stateful apps).
- The right resource request is hard to predict at deploy time.
- You want to optimize resource utilization across many workloads without manual tuning.
VPA requires pod restarts in most modes (except future in-place resize support) and has a slower control loop (1-minute default).
Can They Work Together?
Section titled “Can They Work Together?”Running VPA and HPA on the same workload is risky if both target the same metric.
Conflict scenario: VPA and HPA both react to CPU.
- HPA sees high CPU, scales from 1 to 3 replicas.
- VPA sees high CPU, increases CPU requests.
- Higher requests change the HPA’s utilization calculation (utilization = usage / request).
- HPA might scale down because utilization looks lower with higher requests.
- VPA might scale up again because fewer pods mean higher per-pod usage.
Safe combinations:
- HPA on CPU, VPA in
OfforInitialmode. VPA suggests memory adjustments but does not auto-apply them. - HPA on custom metrics (like HTTP requests per second), VPA on CPU/memory. They target different dimensions.
- VPA controls CPU, HPA controls memory (rare). Requires careful testing.
The Kubernetes community is working on a combined autoscaler (Multidimensional Pod Autoscaler) to coordinate both dimensions, but it is not yet stable.
For now, pick one. Use HPA for stateless workloads that need fast scaling. Use VPA for stateful workloads or environments where you want set-and-forget resource optimization.
Production Considerations
Section titled “Production Considerations”Pod Restarts and Downtime
Section titled “Pod Restarts and Downtime”In Auto and Recreate modes, VPA evicts pods to apply new recommendations. Each eviction causes a pod restart.
For Deployments with multiple replicas, VPA evicts one pod at a time, respecting PodDisruptionBudgets. The workload stays available.
For StatefulSets with a single replica (like a database), evicting the pod causes downtime. Use Off or Initial mode and schedule updates during maintenance windows.
Frequency of Updates
Section titled “Frequency of Updates”VPA does not update pods on every metric change. It uses a threshold (default 10% difference between current and recommended requests) to avoid constant churn.
If your app’s resource usage is stable, VPA might never update it. If usage fluctuates, VPA updates gradually as the histogram shifts.
OOM Handling
Section titled “OOM Handling”If a pod OOMs, VPA learns from it. The next recommendation includes the OOM event as a data point. VPA increases the memory recommendation to prevent future OOMs.
However, VPA cannot prevent the initial OOM. If you deploy a new workload with very low memory requests, it might OOM before VPA gathers enough data. Set reasonable initial requests.
Resource Policy Tuning
Section titled “Resource Policy Tuning”Choose minAllowed and maxAllowed based on your cluster capacity and cost tolerance.
Example for a web app on a cost-sensitive cluster:
resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 10m memory: 32Mi maxAllowed: cpu: 500m memory: 1GiThis prevents VPA from recommending more than 500m CPU or 1Gi memory per pod. If the app needs more, you manually adjust the policy.
Multi-Cluster Environments
Section titled “Multi-Cluster Environments”VPA recommendations are cluster-specific. If you run the same app in dev and prod clusters, VPA might recommend different values based on traffic patterns.
Do not copy VPA recommendations across clusters. Let each cluster’s VPA learn from its own metrics.
Monitoring VPA
Section titled “Monitoring VPA”Check VPA status with:
kubectl describe vpa resource-consumer-vpa -n vpa-demoLook for:
- Recommendation: Current Target values.
- Conditions: Errors or warnings (like missing metrics).
You can also export VPA metrics to Prometheus. The VPA Recommender exposes metrics at /metrics, including recommendation values and update counts.
Common Pitfalls
Section titled “Common Pitfalls”VPA and HPA on the Same Metric
Section titled “VPA and HPA on the Same Metric”As discussed, this causes conflicts. VPA changes requests, HPA recalculates utilization, both fight.
Symptom: Replicas and resource requests oscillate. Pods constantly restart.
Fix: Pick one. Use HPA on a custom metric if you need both.
JVM Applications
Section titled “JVM Applications”JVM apps allocate a heap at startup based on available memory. If VPA increases memory requests after the JVM starts, the heap size does not change. The extra memory goes unused.
Fix: Configure the JVM to respect container limits. Use -XX:+UseContainerSupport (Java 8u191+) so the JVM dynamically adjusts heap size based on cgroup limits. Even better, configure the JVM to use a percentage of available memory:
-XX:MaxRAMPercentage=75.0This way, when VPA increases memory, the JVM uses it.
Batch Jobs
Section titled “Batch Jobs”VPA is designed for long-running workloads. It needs 8 days of data (by default) to build accurate histograms.
Batch jobs that run for minutes or hours do not give VPA enough data. VPA either makes no recommendations or bases them on too few samples.
Fix: Use Off mode for jobs. Set requests based on test runs or historical job metrics outside of VPA.
Missing Metrics Server
Section titled “Missing Metrics Server”VPA depends on the metrics server for CPU and memory data. If the metrics server is not running, VPA has no input.
Check with:
kubectl top pods -n vpa-demoIf this fails, install the metrics server:
minikube addons enable metrics-serverVPA and StatefulSets
Section titled “VPA and StatefulSets”VPA can evict StatefulSet pods, but StatefulSets roll out updates sequentially by default. If you have a 10-replica StatefulSet and VPA evicts a pod, the StatefulSet waits for that pod to become ready before moving to the next.
This makes VPA updates slow for large StatefulSets. Consider using Initial mode instead, and manually trigger updates.
Limits vs Requests
Section titled “Limits vs Requests”VPA only manages resource requests. It does not touch limits by default.
If your pod has requests.cpu: 100m and limits.cpu: 200m, VPA might recommend increasing requests to 300m. The limit stays at 200m. Now the pod’s request exceeds its limit, which is invalid.
Kubernetes patches the limit to match the request, but this might not be what you want. If you use limits, consider enabling VPA’s limit management (experimental feature as of Kubernetes 1.27).
Alternatively, remove limits entirely and rely only on requests. Many production clusters do this to avoid CPU throttling.
Real-World Example
Section titled “Real-World Example”The demo’s deployment starts with very low requests:
# From manifests/deployment.yamlresources: requests: cpu: 50m memory: 64Mi limits: cpu: 500m memory: 256MiBut the stress container actually uses 100M of memory and pegs one CPU core. With cpu: 1 in the stress args, the container tries to use 1000m CPU.
The pod gets throttled at 500m (the limit) and uses far more than the 50m request. VPA sees this and recommends higher values.
After a few minutes in Off mode:
kubectl describe vpa resource-consumer-vpa -n vpa-demoYou might see:
Recommendation: Container Recommendations: Container Name: app Lower Bound: Cpu: 100m Memory: 128Mi Target: Cpu: 250m Memory: 140Mi Uncapped Target: Cpu: 250m Memory: 140Mi Upper Bound: Cpu: 500m Memory: 256MiVPA learned that the pod needs more than 50m CPU and 64Mi memory. The Target recommendation suggests 250m and 140Mi.
Switching to Auto mode:
kubectl delete vpa resource-consumer-vpa -n vpa-demokubectl apply -f manifests/vpa-auto.yamlVPA evicts the pods. The Admission Controller injects the new requests. Check the running pod:
kubectl get pod <pod-name> -n vpa-demo -o jsonpath='{.spec.containers[0].resources}'You see requests close to the Target recommendation.
The VPA Control Loop
Section titled “The VPA Control Loop”- Metrics Collection: VPA Recommender queries the metrics server every 60 seconds.
- Histogram Update: New CPU and memory samples are added to the histogram. Old samples decay.
- Recommendation Calculation: VPA computes Lower Bound, Target, Uncapped Target, and Upper Bound from histogram percentiles.
- Policy Application: VPA clamps Target to
minAllowedandmaxAllowed. - Updater Decision: VPA Updater checks if current pod requests differ from Target by more than 10%.
- Eviction: If yes, Updater evicts the pod (in
AutoorRecreatemode). - Admission: When the replacement pod is created, Admission Controller injects the new requests.
- Repeat: The cycle continues. As usage patterns change, recommendations adapt.
The entire loop is slower than HPA (minutes vs seconds) because VPA needs more data to make stable decisions. Resource requests are expensive to change (pod restart), so VPA waits for high confidence.
Connection to the Demo
Section titled “Connection to the Demo”The demo walks through VPA’s lifecycle:
- Deploy: A stress workload with low CPU and memory requests.
- VPA in Off mode: Watch recommendations without changes. VPA suggests higher values.
- Switch to Auto mode: VPA evicts and recreates pods with updated requests.
- Verify: Compare old and new requests. Check that recommendations respect resource policies.
- Compare to HPA: Understand when to use VPA (right-sizing) vs HPA (scaling out).
The demo uses a stress container to exaggerate the difference between initial requests and actual usage. In production, the gap is usually smaller, but VPA still helps optimize thousands of pods across a cluster.
Further Reading
Section titled “Further Reading”- VPA GitHub repository
- VPA design proposal
- Kubernetes resource management
- HPA documentation
- In-place pod resize (KEP)