ML Model Serving
Deploy a FastAPI prediction service with Horizontal Pod Autoscaler that scales based on CPU load.
Time: ~15 minutes Difficulty: Intermediate
Resources: This demo needs ~1GB RAM. Clean up other demos first:
task clean:all
What You Will Learn
Section titled “What You Will Learn”- Serving a model prediction API on Kubernetes
- Horizontal Pod Autoscaler (HPA) with CPU-based scaling
- Load testing to trigger autoscaling
- Watching replicas scale up and down in real time
- Using ConfigMaps to mount application code
Architecture
Section titled “Architecture”Client --> FastAPI Model Server (Deployment) --> HPA scales based on CPU /predict endpoint min: 1, max: 5 replicasThe model server runs a FastAPI app with a /predict endpoint that simulates inference latency. When CPU utilization exceeds 50%, the HPA adds replicas. When load drops, it scales back down. This pattern works the same whether you are running a real ML model or a simulated one.
Deploy
Section titled “Deploy”Step 1: Create the namespace
Section titled “Step 1: Create the namespace”kubectl apply -f demos/ml-model-serving/manifests/namespace.yamlStep 2: Deploy the model server
Section titled “Step 2: Deploy the model server”kubectl apply -f demos/ml-model-serving/manifests/model-server.yamlStep 3: Deploy the HPA
Section titled “Step 3: Deploy the HPA”kubectl apply -f demos/ml-model-serving/manifests/hpa.yamlStep 4: Wait for the model server to be ready
Section titled “Step 4: Wait for the model server to be ready”kubectl rollout status deployment/model-server -n ml-demo --timeout=120sNote: The first startup takes about 30 seconds while pip installs FastAPI and uvicorn.
Verify
Section titled “Verify”# Check pods are runningkubectl get pods -n ml-demo
# Check the HPAkubectl get hpa -n ml-demo
# Test the predict endpointkubectl port-forward svc/model-server 8000:8000 -n ml-demo &curl http://localhost:8000/predictYou should see a JSON response like {"prediction": 0.42, "model": "v1"}.
Load Test and Watch Autoscaling
Section titled “Load Test and Watch Autoscaling”Step 1: Start the load test
Section titled “Step 1: Start the load test”kubectl apply -f demos/ml-model-serving/manifests/load-test.yamlStep 2: Watch the HPA react
Section titled “Step 2: Watch the HPA react”kubectl get hpa -n ml-demo -wAfter 1-2 minutes of sustained load, you should see the replica count increase from 1 toward 5.
Step 3: Watch pods scale
Section titled “Step 3: Watch pods scale”kubectl get pods -n ml-demo -wStep 4: Stop the load test and watch scale-down
Section titled “Step 4: Stop the load test and watch scale-down”kubectl delete pod load-test -n ml-demoAfter the stabilization window (default 5 minutes), the HPA will scale replicas back down.
What is Happening
Section titled “What is Happening”manifests/ namespace.yaml # ml-demo namespace model-server.yaml # FastAPI Deployment + Service + ConfigMap with Python script hpa.yaml # HPA targeting 50% CPU, 1-5 replicas load-test.yaml # Pod that sends requests in a tight loopThe model-server Deployment mounts a Python script from a ConfigMap. The container installs FastAPI and uvicorn at startup, then runs the app. The /predict endpoint simulates inference by sleeping for 100ms and returning a random prediction value.
The HPA monitors CPU utilization reported by the metrics server. When average CPU across all model-server pods exceeds 50% of the requested amount, it creates new replicas. When load drops, it removes them after a cooldown period.
The load-test pod sends continuous requests to the model-server service, driving CPU utilization up and triggering the autoscaler.
Experiment
Section titled “Experiment”-
Change the HPA target CPU to 80% and observe how it takes more load to trigger scaling:
Terminal window kubectl patch hpa model-server-hpa -n ml-demo \--type=merge -p '{"spec":{"metrics":[{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":80}}}]}}' -
Watch the metrics server data directly:
Terminal window kubectl top pods -n ml-demo -
Check the HPA events to see scaling decisions:
Terminal window kubectl describe hpa model-server-hpa -n ml-demo -
Increase the simulated inference time by editing the ConfigMap (change
time.sleep(0.1)totime.sleep(0.5)) and restart the deployment:Terminal window kubectl rollout restart deployment/model-server -n ml-demo
Cleanup
Section titled “Cleanup”kubectl delete namespace ml-demoFurther Reading
Section titled “Further Reading”See docs/deep-dive.md for details on HPA algorithms, scaling behavior tuning, custom metrics autoscaling, real model serving with TensorFlow Serving or Triton, and production considerations for ML inference workloads.
Next Step
Section titled “Next Step”Move on to GitOps Full Loop to see how Tekton CI and ArgoCD CD connect into a complete deployment pipeline.