Skip to content

ML Model Serving

Deploy a FastAPI prediction service with Horizontal Pod Autoscaler that scales based on CPU load.

Time: ~15 minutes Difficulty: Intermediate

Resources: This demo needs ~1GB RAM. Clean up other demos first: task clean:all

  • Serving a model prediction API on Kubernetes
  • Horizontal Pod Autoscaler (HPA) with CPU-based scaling
  • Load testing to trigger autoscaling
  • Watching replicas scale up and down in real time
  • Using ConfigMaps to mount application code
Client --> FastAPI Model Server (Deployment) --> HPA scales based on CPU
/predict endpoint min: 1, max: 5 replicas

The model server runs a FastAPI app with a /predict endpoint that simulates inference latency. When CPU utilization exceeds 50%, the HPA adds replicas. When load drops, it scales back down. This pattern works the same whether you are running a real ML model or a simulated one.

Terminal window
kubectl apply -f demos/ml-model-serving/manifests/namespace.yaml
Terminal window
kubectl apply -f demos/ml-model-serving/manifests/model-server.yaml
Terminal window
kubectl apply -f demos/ml-model-serving/manifests/hpa.yaml

Step 4: Wait for the model server to be ready

Section titled “Step 4: Wait for the model server to be ready”
Terminal window
kubectl rollout status deployment/model-server -n ml-demo --timeout=120s

Note: The first startup takes about 30 seconds while pip installs FastAPI and uvicorn.

Terminal window
# Check pods are running
kubectl get pods -n ml-demo
# Check the HPA
kubectl get hpa -n ml-demo
# Test the predict endpoint
kubectl port-forward svc/model-server 8000:8000 -n ml-demo &
curl http://localhost:8000/predict

You should see a JSON response like {"prediction": 0.42, "model": "v1"}.

Terminal window
kubectl apply -f demos/ml-model-serving/manifests/load-test.yaml
Terminal window
kubectl get hpa -n ml-demo -w

After 1-2 minutes of sustained load, you should see the replica count increase from 1 toward 5.

Terminal window
kubectl get pods -n ml-demo -w

Step 4: Stop the load test and watch scale-down

Section titled “Step 4: Stop the load test and watch scale-down”
Terminal window
kubectl delete pod load-test -n ml-demo

After the stabilization window (default 5 minutes), the HPA will scale replicas back down.

manifests/
namespace.yaml # ml-demo namespace
model-server.yaml # FastAPI Deployment + Service + ConfigMap with Python script
hpa.yaml # HPA targeting 50% CPU, 1-5 replicas
load-test.yaml # Pod that sends requests in a tight loop

The model-server Deployment mounts a Python script from a ConfigMap. The container installs FastAPI and uvicorn at startup, then runs the app. The /predict endpoint simulates inference by sleeping for 100ms and returning a random prediction value.

The HPA monitors CPU utilization reported by the metrics server. When average CPU across all model-server pods exceeds 50% of the requested amount, it creates new replicas. When load drops, it removes them after a cooldown period.

The load-test pod sends continuous requests to the model-server service, driving CPU utilization up and triggering the autoscaler.

  1. Change the HPA target CPU to 80% and observe how it takes more load to trigger scaling:

    Terminal window
    kubectl patch hpa model-server-hpa -n ml-demo \
    --type=merge -p '{"spec":{"metrics":[{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":80}}}]}}'
  2. Watch the metrics server data directly:

    Terminal window
    kubectl top pods -n ml-demo
  3. Check the HPA events to see scaling decisions:

    Terminal window
    kubectl describe hpa model-server-hpa -n ml-demo
  4. Increase the simulated inference time by editing the ConfigMap (change time.sleep(0.1) to time.sleep(0.5)) and restart the deployment:

    Terminal window
    kubectl rollout restart deployment/model-server -n ml-demo
Terminal window
kubectl delete namespace ml-demo

See docs/deep-dive.md for details on HPA algorithms, scaling behavior tuning, custom metrics autoscaling, real model serving with TensorFlow Serving or Triton, and production considerations for ML inference workloads.

Move on to GitOps Full Loop to see how Tekton CI and ArgoCD CD connect into a complete deployment pipeline.