ML Model Serving: Deep Dive
Overview
Section titled “Overview”This document covers horizontal pod autoscaling for inference workloads, the tradeoffs of different scaling strategies, and what changes when you move from a simulated model to a real one.
How the HPA Works
Section titled “How the HPA Works”The Horizontal Pod Autoscaler runs a control loop every 15 seconds (configurable). It queries the metrics API for current CPU utilization, computes the desired replica count, and adjusts the Deployment.
The formula:
desiredReplicas = ceil(currentReplicas * (currentUtilization / targetUtilization))For example, if 1 pod is at 90% CPU and the target is 50%, the HPA wants ceil(1 * 90/50) = 2 replicas.
Scaling Behavior
Section titled “Scaling Behavior”Scale-up
Section titled “Scale-up”By default, the HPA can double the number of replicas or add 4 pods (whichever is larger) every 15 seconds. This allows rapid response to load spikes.
Scale-down
Section titled “Scale-down”Scale-down is more conservative. The HPA waits for a stabilization window (default 300 seconds) before removing pods. This prevents flapping when load fluctuates.
You can customize both behaviors:
behavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 60Custom Metrics Autoscaling
Section titled “Custom Metrics Autoscaling”CPU is a rough proxy for inference load. Better options include:
- Request queue depth: Scale based on pending requests
- Inference latency (p99): Scale when latency exceeds a threshold
- GPU utilization: For GPU-based models
These require a custom metrics adapter like Prometheus Adapter or KEDA.
Real Model Serving
Section titled “Real Model Serving”In production, you would replace the simulated Python script with:
| Framework | Use Case |
|---|---|
| TensorFlow Serving | TensorFlow/Keras models |
| Triton Inference Server | Multi-framework (TF, PyTorch, ONNX) |
| TorchServe | PyTorch models |
| Seldon Core | ML deployment platform for Kubernetes |
| KServe | Serverless inference on Kubernetes |
ConfigMap vs Container Image for Code
Section titled “ConfigMap vs Container Image for Code”This demo mounts the Python script via ConfigMap for simplicity and fast iteration. In production, bake the application code and model weights into the container image for reproducibility, versioning, and faster startup.
Resource Considerations
Section titled “Resource Considerations”ML inference is often memory-bound (loading model weights) or GPU-bound. Set resource requests accurately to avoid:
- Under-provisioning: OOMKilled pods when loading large models
- Over-provisioning: Wasted cluster resources and poor bin-packing