Skip to content

Knative Serving: Deep Dive

Kubernetes excels at running long-lived workloads. You deploy a Deployment, it creates pods, those pods run until you delete them. This works well for always-on services like APIs and web servers.

But many workloads are bursty. A webhook receiver might handle 100 requests per minute during business hours and zero requests overnight. A data processing function might run once an hour. A development preview environment might only see traffic when developers are testing.

In these scenarios, running pods 24/7 wastes resources. Traditional Kubernetes offers the Horizontal Pod Autoscaler, which adds and removes pods based on CPU or custom metrics. But the HPA has a hard floor of 1 replica. It cannot scale to zero. Your pods consume cluster resources even when completely idle.

Cloud providers solved this with serverless offerings like AWS Lambda, Google Cloud Functions, and Azure Functions. You write a function, upload it, and the platform runs it only when invoked. When idle, your function consumes zero resources and costs nothing.

Knative Serving brings this serverless model to Kubernetes. It provides scale-to-zero, automatic scale-up on incoming requests, revision-based deployments for traffic splitting, and concurrency-based autoscaling. You get the operational simplicity of serverless without leaving Kubernetes.

Knative Serving introduces new resource types. Understanding the relationship between them is essential.

A Knative Service is a high-level abstraction. When you create a Service, Knative creates two child resources: a Configuration and a Route.

# From manifests/service-hello.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello
namespace: knative-demo
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/min-scale: "0"
autoscaling.knative.dev/max-scale: "5"
spec:
containers:
- image: ghcr.io/knative/helloworld-go:latest
env:
- name: TARGET
value: "World v1"
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi

The Service creates a Configuration with the container spec and resource limits. The Configuration creates a Revision. Each Revision is an immutable snapshot of the container image, environment variables, and resource requests at a specific point in time.

When you update the Service (change the image, add an env var, modify resources), the Configuration creates a new Revision. Old Revisions continue to exist. You can route traffic to any Revision, not just the latest.

The Route defines how traffic is distributed across Revisions. It maps the Service’s URL to one or more Revisions with percentage-based weights.

Service (hello)
├── Configuration (hello)
│ ├── Revision (hello-00001)
│ └── Revision (hello-00002)
└── Route (hello)
└── Traffic split (80% hello-00002, 20% hello-00001)

This separation of concerns is powerful. The Configuration owns the “what” (container spec). The Route owns the “how” (traffic routing). You can change routing without creating a new Revision.

Revisions never change. If you want to update the container image, you do not modify the Revision. You update the Service, which creates a new Revision.

This immutability enables safe rollbacks. You can pin traffic to an old Revision at any time. The old Revision’s pods still exist (or can be recreated if scaled to zero).

Revisions are named sequentially: hello-00001, hello-00002, hello-00003. The generation number is automatically incremented each time the Configuration changes.

By default, Knative keeps the last 20 Revisions. Older Revisions are automatically deleted. This prevents etcd bloat.

You can configure this limit globally in the config-gc ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
name: config-gc
namespace: knative-serving
data:
retain-since-create-time: "48h"
retain-since-last-active-time: "15h"
min-non-active-revisions: "2"
max-non-active-revisions: "20"

Revisions that have not received traffic in 15 hours are eligible for deletion, but at least 2 non-active Revisions are kept for rollback purposes.

When a Knative Service receives no traffic for a configurable period (default 60 seconds), all pods are terminated. The Revision scales to zero replicas.

The Activator is a Knative system component that buffers requests while pods are starting. When a Revision is scaled to zero, the Route directs traffic to the Activator instead of the pods.

The Activator receives the request, wakes up the Revision (scales it from 0 to 1), and holds the request until a pod is ready. Once the pod passes health checks, the Activator forwards the buffered request to the pod.

From the client’s perspective, the request is slow (cold start latency) but not dropped. Subsequent requests go directly to the pod, bypassing the Activator.

Cold starts can take anywhere from 500ms to 10+ seconds depending on container image size, startup probes, and initialization logic.

Factors that affect cold start time:

  • Container image size: Larger images take longer to pull. Use slim base images (Alpine, distroless) and multi-stage builds.
  • Readiness probes: The pod must pass readiness checks before receiving traffic. Slow probes delay the first request.
  • Application initialization: Loading config files, connecting to databases, warming caches all add latency.

If cold starts are unacceptable, set min-scale: 1 to keep at least one pod always running:

# From manifests/service-autoscale.yaml
template:
metadata:
annotations:
autoscaling.knative.dev/min-scale: "0"
autoscaling.knative.dev/max-scale: "5"

Setting min-scale: 1 disables scale-to-zero but keeps autoscaling above 1 replica.

The grace period controls how long Knative waits after the last request before scaling to zero. The default is 60 seconds.

You can override it per Service:

template:
metadata:
annotations:
autoscaling.knative.dev/scale-to-zero-grace-period: "30s"

Shorter grace periods reduce resource consumption. Longer grace periods reduce cold starts during bursty traffic.

Knative Serving includes an autoscaler that watches request metrics and adjusts replica count. Unlike the HPA, which scales based on CPU or memory, Knative scales based on concurrency or requests per second.

Knative supports two autoscaling modes: KPA (Knative Pod Autoscaler) and HPA (Kubernetes Horizontal Pod Autoscaler).

KPA is the default. It scales based on concurrent requests per pod. KPA can scale to zero. It reacts faster than the HPA because it scrapes metrics every 2 seconds instead of every 15 seconds.

HPA is opt-in. It uses the standard Kubernetes HPA with CPU or memory metrics. HPA cannot scale to zero (minimum 1 replica). Use HPA when you need to scale on resource metrics instead of request metrics.

To enable HPA mode:

template:
metadata:
annotations:
autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "cpu"
autoscaling.knative.dev/target: "70"

For most HTTP workloads, KPA with concurrency-based scaling is better.

The default autoscaling metric is concurrency. The autoscaler tries to keep each pod handling a target number of concurrent requests.

# From manifests/service-autoscale.yaml
template:
metadata:
annotations:
autoscaling.knative.dev/target: "10"

This sets the target to 10 concurrent requests per pod. If a single pod is handling 30 concurrent requests, the autoscaler scales to 3 pods (30 / 10 = 3).

The concurrency metric is measured by the Queue Proxy sidecar. Every Knative pod includes a Queue Proxy container that sits in front of the application container. All requests pass through the Queue Proxy, which tracks how many are currently being processed.

You can switch to requests per second (RPS) instead of concurrency:

template:
metadata:
annotations:
autoscaling.knative.dev/metric: "rps"
autoscaling.knative.dev/target: "50"

This scales to maintain 50 requests per second per pod. If traffic reaches 200 RPS, the autoscaler scales to 4 pods (200 / 50 = 4).

RPS scaling is better for workloads with predictable request rates. Concurrency scaling is better for workloads where requests have variable latency.

When traffic spikes suddenly, waiting for gradual scaling could overwhelm the existing pods. Knative has a panic mode that scales up aggressively during sudden load increases.

Panic mode triggers when the 6-second average concurrency exceeds 2x the target. In panic mode, the autoscaler computes the desired replica count based on the 6-second window instead of the 60-second stable window.

Normal scaling uses a 60-second window to avoid reacting to brief spikes. Panic mode uses a 6-second window to react quickly. Once the spike subsides, panic mode exits and normal scaling resumes.

You can tune panic mode thresholds:

template:
metadata:
annotations:
autoscaling.knative.dev/panic-threshold-percentage: "200"
autoscaling.knative.dev/panic-window-percentage: "10"

panic-threshold-percentage: 200 means panic mode triggers when concurrency exceeds 200% of the target. panic-window-percentage: 10 sets the panic window to 10% of the stable window (6 seconds if the stable window is 60 seconds).

The max-scale annotation caps the number of replicas:

template:
metadata:
annotations:
autoscaling.knative.dev/max-scale: "5"

This prevents runaway scaling during a traffic spike or DoS attack. Without a max-scale, a sudden flood of requests could spin up hundreds of pods and exhaust cluster resources.

Knative supports fine-grained traffic splitting across Revisions. This enables canary deployments, blue-green deployments, and A/B testing.

The traffic block defines how traffic is distributed:

# From manifests/service-hello-v2.yaml
traffic:
- latestRevision: true
percent: 80
- latestRevision: false
percent: 20
revisionName: hello-00001

This sends 80% of traffic to the latest Revision (hello-00002) and 20% to a pinned Revision (hello-00001). The percentages must sum to 100.

Instead of latestRevision: true, you can pin both routes to specific Revisions:

# From manifests/service-hello-v3.yaml
traffic:
- revisionName: hello-00002
percent: 50
- revisionName: hello-00003
percent: 50

This creates a 50/50 blue-green split between Revision 2 and Revision 3. Even if you update the Service again (creating hello-00004), traffic continues to split 50/50 between hello-00002 and hello-00003 because the traffic block explicitly pins them.

You can assign tags to Revisions for stable URLs:

traffic:
- revisionName: hello-00001
percent: 100
tag: stable
- revisionName: hello-00002
percent: 0
tag: canary

This routes 100% of traffic to hello-00001 but creates a separate URL for hello-00002:

http://hello.knative-demo.svc.cluster.local # 100% hello-00001
http://stable-hello.knative-demo.svc.cluster.local # hello-00001
http://canary-hello.knative-demo.svc.cluster.local # hello-00002

The canary tag receives 0% of the main traffic but is reachable via its own URL for testing.

A typical canary rollout:

  1. Deploy new version with 95% stable, 5% canary.
  2. Monitor metrics (error rate, latency) for the canary.
  3. If metrics look good, increase to 80% stable, 20% canary.
  4. Continue increasing until 0% stable, 100% canary.
  5. Remove the old Revision from the traffic block.

Knative does not automate this process. You must manually update the traffic block at each phase. For automated canary analysis, use Flagger or Argo Rollouts (which support Knative as a deployment target).

Knative Serving requires a networking layer to route external traffic to Services. The most common options are Kourier, Istio, and Contour.

Kourier is a lightweight Envoy-based ingress controller designed specifically for Knative. It has minimal dependencies and is easier to set up than Istio.

The demo uses Kourier:

Terminal window
kubectl apply -f https://github.com/knative/net-kourier/releases/latest/download/kourier.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
-p '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

Kourier creates an Envoy proxy that handles HTTP routing based on the Route’s traffic split configuration.

Istio is a full service mesh. It provides advanced traffic management (mirroring, fault injection), mTLS between services, and detailed observability.

If you already run Istio for other workloads, use it for Knative:

Terminal window
kubectl apply -f https://github.com/knative/net-istio/releases/latest/download/net-istio.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
-p '{"data":{"ingress-class":"istio.ingress.networking.knative.dev"}}'

Istio adds significant complexity and resource overhead. Use Kourier unless you need Istio’s advanced features.

Contour is another Envoy-based ingress controller. It is more feature-rich than Kourier but lighter than Istio.

Terminal window
kubectl apply -f https://github.com/knative/net-contour/releases/latest/download/contour.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
-p '{"data":{"ingress-class":"contour.ingress.networking.knative.dev"}}'

Contour supports advanced routing rules and integrates with cert-manager for TLS.

Knative generates URLs like http://hello.knative-demo.10.0.0.1.sslip.io. The domain comes from the config-domain ConfigMap.

For local development with minikube, use the serving-default-domain job:

Terminal window
kubectl apply -f https://github.com/knative/serving/releases/latest/download/serving-default-domain.yaml

This automatically configures sslip.io DNS, which resolves 10.0.0.1.sslip.io to 10.0.0.1. Replace 10.0.0.1 with your minikube IP and DNS works.

For production, configure a real domain:

apiVersion: v1
kind: ConfigMap
metadata:
name: config-domain
namespace: knative-serving
data:
example.com: ""

Services in the default namespace get URLs like http://hello.default.example.com.

Every Knative pod includes a Queue Proxy container. This sidecar sits in front of the application container and handles several critical tasks.

The Queue Proxy tracks how many requests are currently in flight (concurrency) and how many requests per second are being served. These metrics feed the autoscaler.

Without the Queue Proxy, the autoscaler would have no visibility into request load. It would have to rely on CPU or memory, which are poor proxies for HTTP load.

You can limit how many concurrent requests a single pod accepts:

spec:
containerConcurrency: 10

This tells the Queue Proxy to reject requests when 10 are already in flight. Excess requests are queued at the Activator or load balancer until a pod becomes available.

This is useful for workloads that consume significant per-request resources (like database connections). Without a concurrency limit, a single pod might accept 1000 concurrent requests and exhaust its connection pool.

The default containerConcurrency: 0 means unlimited.

The Queue Proxy exposes health check endpoints that Kubernetes uses for liveness and readiness probes. This decouples the application’s health check logic from Kubernetes probe configuration.

Knative Serving is not the only way to run serverless workloads on Kubernetes.

OpenFaaS is a simpler serverless framework. You write a function, wrap it in a container, and deploy it with the faas-cli. OpenFaaS scales based on RPS with Prometheus metrics.

OpenFaaS advantages: Easier to get started. Less Kubernetes knowledge required. Built-in UI for deploying functions.

Knative advantages: Native Kubernetes resources. Better integration with GitOps tools (Argo CD, Flux). More flexible traffic splitting. Revision-based deployments.

Use OpenFaaS if you want a batteries-included platform for deploying functions. Use Knative if you want serverless capabilities that feel like native Kubernetes.

AWS Lambda is a fully managed serverless platform. You upload code, AWS runs it, you pay per invocation.

Lambda advantages: Zero infrastructure management. Instant scaling to thousands of concurrent invocations. Built-in integrations with AWS services.

Knative advantages: Runs on any Kubernetes cluster (on-prem, multi-cloud). No vendor lock-in. Full control over the container image and runtime.

Use Lambda if you are all-in on AWS and want zero operational overhead. Use Knative if you need portability or want to avoid cloud provider lock-in.

A standard Deployment with an HPA can autoscale based on CPU or custom metrics.

HPA advantages: Simpler. No new CRDs or components. Well understood by most Kubernetes users.

Knative advantages: Scale to zero. Concurrency-based autoscaling. Revision-based traffic splitting. Faster scaling (2-second metric scrape vs 15-second).

Use HPA for always-on workloads with predictable traffic. Use Knative for bursty workloads where scale-to-zero saves significant resources.

Minimize cold start latency with these techniques:

Use small container images: Multi-stage builds, distroless base images, and layer caching reduce pull time.

Optimize application startup: Defer expensive initialization (loading ML models, warming caches) until after the first request.

Set aggressive readiness probes: The pod cannot receive traffic until it passes readiness. Reduce initialDelaySeconds and periodSeconds if your app starts quickly.

Keep warm with min-scale: For latency-sensitive workloads, set min-scale: 1 or higher to keep a pool of warm pods.

Pre-warm with traffic: Send periodic health check requests to prevent scale-to-zero during low-traffic periods.

Always set CPU and memory limits on Knative Services:

resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi

Without limits, a single pod could consume unbounded resources. With hundreds of autoscaled pods, this could exhaust the cluster.

Requests are used for scheduling. Limits are enforced by the container runtime. Set limits slightly higher than typical usage to allow for spikes without throttling.

Knative generates detailed metrics via Prometheus. Key metrics to monitor:

  • revision_request_count: Total requests per Revision.
  • revision_request_latencies: Request latency histogram.
  • autoscaler_desired_pods: The autoscaler’s target replica count.
  • autoscaler_actual_pods: Current replica count.
  • activator_request_count: Requests buffered by the Activator (cold starts).

Enable Prometheus scraping:

apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: knative-serving
data:
metrics.backend-destination: "prometheus"

Then deploy a ServiceMonitor to scrape Knative pods.

Network policies: Restrict which namespaces can reach Knative Services. By default, Routes are publicly accessible.

mTLS: If using Istio, enable mTLS between the ingress and Revisions.

Resource quotas: Set namespace-level ResourceQuotas to prevent a single Service from consuming all cluster resources.

Image scanning: Scan container images for vulnerabilities before deploying.

Without DNS setup, Services get URLs like http://hello.knative-demo.svc.cluster.local. This works within the cluster but not from outside.

You must either configure a real domain in config-domain or use the serving-default-domain job for local development.

If you install Kourier but forget to patch config-network, Knative defaults to Istio. Services will be stuck in “NotReady” because the Istio ingress does not exist.

Always verify the ingress class:

Terminal window
kubectl get configmap config-network -n knative-serving -o yaml

This is rejected by the API:

traffic:
- latestRevision: true
percent: 80
- revisionName: hello-00001
percent: 10

The sum is 90. Knative requires exactly 100. Add another route or adjust the percentages.

Revisions are named <service-name>-<generation>. The generation is auto-incremented and not related to your version tags.

Do not hardcode Revision names like hello-00003 in YAML. Use latestRevision: true or tags. The generation number changes every time the Configuration is updated, including non-functional changes like annotations.

If you run a load test with long pauses between requests, Knative might scale to zero mid-test. Your results will include cold start latency and look terrible.

Either set min-scale: 1 during testing or generate continuous load.

Setting containerConcurrency: 1 means each pod handles only one request at a time. If you have high RPS, the autoscaler spins up hundreds of pods.

For most HTTP services, containerConcurrency: 0 (unlimited) is appropriate. The concurrency target controls scaling, not the per-pod limit.