ML Model Serving: Deep Dive

Overview

This document covers horizontal pod autoscaling for inference workloads, the tradeoffs of different scaling strategies, and what changes when you move from a simulated model to a real one.

How the HPA Works

The Horizontal Pod Autoscaler runs a control loop every 15 seconds (configurable). It queries the metrics API for current CPU utilization, computes the desired replica count, and adjusts the Deployment.

The formula:

desiredReplicas = ceil(currentReplicas * (currentUtilization / targetUtilization))

For example, if 1 pod is at 90% CPU and the target is 50%, the HPA wants ceil(1 * 90/50) = 2 replicas.

Scaling Behavior

Scale-up

By default, the HPA can double the number of replicas or add 4 pods (whichever is larger) every 15 seconds. This allows rapid response to load spikes.

Scale-down

Scale-down is more conservative. The HPA waits for a stabilization window (default 300 seconds) before removing pods. This prevents flapping when load fluctuates.

You can customize both behaviors:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Pods
        value: 1
        periodSeconds: 60

Custom Metrics Autoscaling

CPU is a rough proxy for inference load. Better options include:

Request queue depth: Scale based on pending requests
Inference latency (p99): Scale when latency exceeds a threshold
GPU utilization: For GPU-based models

These require a custom metrics adapter like Prometheus Adapter or KEDA.

Real Model Serving

In production, you would replace the simulated Python script with:

Framework	Use Case
TensorFlow Serving	TensorFlow/Keras models
Triton Inference Server	Multi-framework (TF, PyTorch, ONNX)
TorchServe	PyTorch models
Seldon Core	ML deployment platform for Kubernetes
KServe	Serverless inference on Kubernetes

ConfigMap vs Container Image for Code

This demo mounts the Python script via ConfigMap for simplicity and fast iteration. In production, bake the application code and model weights into the container image for reproducibility, versioning, and faster startup.

Resource Considerations

ML inference is often memory-bound (loading model weights) or GPU-bound. Set resource requests accurately to avoid:

Under-provisioning: OOMKilled pods when loading large models
Over-provisioning: Wasted cluster resources and poor bin-packing