Skip to content

ML Model Serving: Deep Dive

This document covers horizontal pod autoscaling for inference workloads, the tradeoffs of different scaling strategies, and what changes when you move from a simulated model to a real one.

The Horizontal Pod Autoscaler runs a control loop every 15 seconds (configurable). It queries the metrics API for current CPU utilization, computes the desired replica count, and adjusts the Deployment.

The formula:

desiredReplicas = ceil(currentReplicas * (currentUtilization / targetUtilization))

For example, if 1 pod is at 90% CPU and the target is 50%, the HPA wants ceil(1 * 90/50) = 2 replicas.

By default, the HPA can double the number of replicas or add 4 pods (whichever is larger) every 15 seconds. This allows rapid response to load spikes.

Scale-down is more conservative. The HPA waits for a stabilization window (default 300 seconds) before removing pods. This prevents flapping when load fluctuates.

You can customize both behaviors:

behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60

CPU is a rough proxy for inference load. Better options include:

  • Request queue depth: Scale based on pending requests
  • Inference latency (p99): Scale when latency exceeds a threshold
  • GPU utilization: For GPU-based models

These require a custom metrics adapter like Prometheus Adapter or KEDA.

In production, you would replace the simulated Python script with:

FrameworkUse Case
TensorFlow ServingTensorFlow/Keras models
Triton Inference ServerMulti-framework (TF, PyTorch, ONNX)
TorchServePyTorch models
Seldon CoreML deployment platform for Kubernetes
KServeServerless inference on Kubernetes

This demo mounts the Python script via ConfigMap for simplicity and fast iteration. In production, bake the application code and model weights into the container image for reproducibility, versioning, and faster startup.

ML inference is often memory-bound (loading model weights) or GPU-bound. Set resource requests accurately to avoid:

  • Under-provisioning: OOMKilled pods when loading large models
  • Over-provisioning: Wasted cluster resources and poor bin-packing