ML Model Serving

Deploy a FastAPI prediction service with Horizontal Pod Autoscaler that scales based on CPU load.

Time: ~15 minutes Difficulty: Intermediate

Resources: This demo needs ~1GB RAM. Clean up other demos first: task clean:all

What You Will Learn

Serving a model prediction API on Kubernetes
Horizontal Pod Autoscaler (HPA) with CPU-based scaling
Load testing to trigger autoscaling
Watching replicas scale up and down in real time
Using ConfigMaps to mount application code

Architecture

Client  -->  FastAPI Model Server (Deployment)  -->  HPA scales based on CPU
              /predict endpoint                      min: 1, max: 5 replicas

The model server runs a FastAPI app with a /predict endpoint that simulates inference latency. When CPU utilization exceeds 50%, the HPA adds replicas. When load drops, it scales back down. This pattern works the same whether you are running a real ML model or a simulated one.

Deploy

Step 1: Create the namespace

kubectl apply -f demos/ml-model-serving/manifests/namespace.yaml

Step 2: Deploy the model server

kubectl apply -f demos/ml-model-serving/manifests/model-server.yaml

Step 3: Deploy the HPA

kubectl apply -f demos/ml-model-serving/manifests/hpa.yaml

Step 4: Wait for the model server to be ready

kubectl rollout status deployment/model-server -n ml-demo --timeout=120s

Note: The first startup takes about 30 seconds while pip installs FastAPI and uvicorn.

Verify

# Check pods are running
kubectl get pods -n ml-demo

# Check the HPA
kubectl get hpa -n ml-demo

# Test the predict endpoint
kubectl port-forward svc/model-server 8000:8000 -n ml-demo &
curl http://localhost:8000/predict

You should see a JSON response like {"prediction": 0.42, "model": "v1"}.

Load Test and Watch Autoscaling

Step 1: Start the load test

kubectl apply -f demos/ml-model-serving/manifests/load-test.yaml

Step 2: Watch the HPA react

kubectl get hpa -n ml-demo -w

After 1-2 minutes of sustained load, you should see the replica count increase from 1 toward 5.

Step 3: Watch pods scale

kubectl get pods -n ml-demo -w

Step 4: Stop the load test and watch scale-down

kubectl delete pod load-test -n ml-demo

After the stabilization window (default 5 minutes), the HPA will scale replicas back down.

What is Happening

manifests/
  namespace.yaml       # ml-demo namespace
  model-server.yaml    # FastAPI Deployment + Service + ConfigMap with Python script
  hpa.yaml             # HPA targeting 50% CPU, 1-5 replicas
  load-test.yaml       # Pod that sends requests in a tight loop

The model-server Deployment mounts a Python script from a ConfigMap. The container installs FastAPI and uvicorn at startup, then runs the app. The /predict endpoint simulates inference by sleeping for 100ms and returning a random prediction value.

The HPA monitors CPU utilization reported by the metrics server. When average CPU across all model-server pods exceeds 50% of the requested amount, it creates new replicas. When load drops, it removes them after a cooldown period.

The load-test pod sends continuous requests to the model-server service, driving CPU utilization up and triggering the autoscaler.

Experiment

Change the HPA target CPU to 80% and observe how it takes more load to trigger scaling:

kubectl patch hpa model-server-hpa -n ml-demo \
  --type=merge -p '{"spec":{"metrics":[{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":80}}}]}}'

Watch the metrics server data directly:
Terminal window
```
kubectl top pods -n ml-demo
```
Check the HPA events to see scaling decisions:
Terminal window
```
kubectl describe hpa model-server-hpa -n ml-demo
```
Increase the simulated inference time by editing the ConfigMap (change time.sleep(0.1) to time.sleep(0.5)) and restart the deployment:
Terminal window
```
kubectl rollout restart deployment/model-server -n ml-demo
```

Cleanup

kubectl delete namespace ml-demo

Next Step

Move on to GitOps Full Loop to see how Tekton CI and ArgoCD CD connect into a complete deployment pipeline.