Knative Serving: Deep Dive

Why Knative Serving Exists

Kubernetes excels at running long-lived workloads. You deploy a Deployment, it creates pods, those pods run until you delete them. This works well for always-on services like APIs and web servers.

But many workloads are bursty. A webhook receiver might handle 100 requests per minute during business hours and zero requests overnight. A data processing function might run once an hour. A development preview environment might only see traffic when developers are testing.

In these scenarios, running pods 24/7 wastes resources. Traditional Kubernetes offers the Horizontal Pod Autoscaler, which adds and removes pods based on CPU or custom metrics. But the HPA has a hard floor of 1 replica. It cannot scale to zero. Your pods consume cluster resources even when completely idle.

Cloud providers solved this with serverless offerings like AWS Lambda, Google Cloud Functions, and Azure Functions. You write a function, upload it, and the platform runs it only when invoked. When idle, your function consumes zero resources and costs nothing.

Knative Serving brings this serverless model to Kubernetes. It provides scale-to-zero, automatic scale-up on incoming requests, revision-based deployments for traffic splitting, and concurrency-based autoscaling. You get the operational simplicity of serverless without leaving Kubernetes.

The Resource Model

Knative Serving introduces new resource types. Understanding the relationship between them is essential.

Service, Configuration, Route, Revision

A Knative Service is a high-level abstraction. When you create a Service, Knative creates two child resources: a Configuration and a Route.

# From manifests/service-hello.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
  namespace: knative-demo
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "5"
    spec:
      containers:
      - image: ghcr.io/knative/helloworld-go:latest
        env:
        - name: TARGET
          value: "World v1"
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
          limits:
            cpu: 200m
            memory: 128Mi

The Service creates a Configuration with the container spec and resource limits. The Configuration creates a Revision. Each Revision is an immutable snapshot of the container image, environment variables, and resource requests at a specific point in time.

When you update the Service (change the image, add an env var, modify resources), the Configuration creates a new Revision. Old Revisions continue to exist. You can route traffic to any Revision, not just the latest.

The Route defines how traffic is distributed across Revisions. It maps the Service’s URL to one or more Revisions with percentage-based weights.

Service (hello)
  ├── Configuration (hello)
  │   ├── Revision (hello-00001)
  │   └── Revision (hello-00002)
  └── Route (hello)
      └── Traffic split (80% hello-00002, 20% hello-00001)

This separation of concerns is powerful. The Configuration owns the “what” (container spec). The Route owns the “how” (traffic routing). You can change routing without creating a new Revision.

Revisions Are Immutable

Revisions never change. If you want to update the container image, you do not modify the Revision. You update the Service, which creates a new Revision.

This immutability enables safe rollbacks. You can pin traffic to an old Revision at any time. The old Revision’s pods still exist (or can be recreated if scaled to zero).

Revisions are named sequentially: hello-00001, hello-00002, hello-00003. The generation number is automatically incremented each time the Configuration changes.

Revision Garbage Collection

By default, Knative keeps the last 20 Revisions. Older Revisions are automatically deleted. This prevents etcd bloat.

You can configure this limit globally in the config-gc ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-gc
  namespace: knative-serving
data:
  retain-since-create-time: "48h"
  retain-since-last-active-time: "15h"
  min-non-active-revisions: "2"
  max-non-active-revisions: "20"

Revisions that have not received traffic in 15 hours are eligible for deletion, but at least 2 non-active Revisions are kept for rollback purposes.

How Scale-to-Zero Works

When a Knative Service receives no traffic for a configurable period (default 60 seconds), all pods are terminated. The Revision scales to zero replicas.

The Activator Component

The Activator is a Knative system component that buffers requests while pods are starting. When a Revision is scaled to zero, the Route directs traffic to the Activator instead of the pods.

The Activator receives the request, wakes up the Revision (scales it from 0 to 1), and holds the request until a pod is ready. Once the pod passes health checks, the Activator forwards the buffered request to the pod.

From the client’s perspective, the request is slow (cold start latency) but not dropped. Subsequent requests go directly to the pod, bypassing the Activator.

Cold Start Latency

Cold starts can take anywhere from 500ms to 10+ seconds depending on container image size, startup probes, and initialization logic.

Factors that affect cold start time:

Container image size: Larger images take longer to pull. Use slim base images (Alpine, distroless) and multi-stage builds.
Readiness probes: The pod must pass readiness checks before receiving traffic. Slow probes delay the first request.
Application initialization: Loading config files, connecting to databases, warming caches all add latency.

If cold starts are unacceptable, set min-scale: 1 to keep at least one pod always running:

# From manifests/service-autoscale.yaml
template:
  metadata:
    annotations:
      autoscaling.knative.dev/min-scale: "0"
      autoscaling.knative.dev/max-scale: "5"

Setting min-scale: 1 disables scale-to-zero but keeps autoscaling above 1 replica.

Scale-to-Zero Grace Period

The grace period controls how long Knative waits after the last request before scaling to zero. The default is 60 seconds.

You can override it per Service:

template:
  metadata:
    annotations:
      autoscaling.knative.dev/scale-to-zero-grace-period: "30s"

Shorter grace periods reduce resource consumption. Longer grace periods reduce cold starts during bursty traffic.

Autoscaling

Knative Serving includes an autoscaler that watches request metrics and adjusts replica count. Unlike the HPA, which scales based on CPU or memory, Knative scales based on concurrency or requests per second.

KPA vs HPA

Knative supports two autoscaling modes: KPA (Knative Pod Autoscaler) and HPA (Kubernetes Horizontal Pod Autoscaler).

KPA is the default. It scales based on concurrent requests per pod. KPA can scale to zero. It reacts faster than the HPA because it scrapes metrics every 2 seconds instead of every 15 seconds.

HPA is opt-in. It uses the standard Kubernetes HPA with CPU or memory metrics. HPA cannot scale to zero (minimum 1 replica). Use HPA when you need to scale on resource metrics instead of request metrics.

To enable HPA mode:

template:
  metadata:
    annotations:
      autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
      autoscaling.knative.dev/metric: "cpu"
      autoscaling.knative.dev/target: "70"

For most HTTP workloads, KPA with concurrency-based scaling is better.

Concurrency Target

The default autoscaling metric is concurrency. The autoscaler tries to keep each pod handling a target number of concurrent requests.

# From manifests/service-autoscale.yaml
template:
  metadata:
    annotations:
      autoscaling.knative.dev/target: "10"

This sets the target to 10 concurrent requests per pod. If a single pod is handling 30 concurrent requests, the autoscaler scales to 3 pods (30 / 10 = 3).

The concurrency metric is measured by the Queue Proxy sidecar. Every Knative pod includes a Queue Proxy container that sits in front of the application container. All requests pass through the Queue Proxy, which tracks how many are currently being processed.

RPS Target

You can switch to requests per second (RPS) instead of concurrency:

template:
  metadata:
    annotations:
      autoscaling.knative.dev/metric: "rps"
      autoscaling.knative.dev/target: "50"

This scales to maintain 50 requests per second per pod. If traffic reaches 200 RPS, the autoscaler scales to 4 pods (200 / 50 = 4).

RPS scaling is better for workloads with predictable request rates. Concurrency scaling is better for workloads where requests have variable latency.

Panic Mode

When traffic spikes suddenly, waiting for gradual scaling could overwhelm the existing pods. Knative has a panic mode that scales up aggressively during sudden load increases.

Panic mode triggers when the 6-second average concurrency exceeds 2x the target. In panic mode, the autoscaler computes the desired replica count based on the 6-second window instead of the 60-second stable window.

Normal scaling uses a 60-second window to avoid reacting to brief spikes. Panic mode uses a 6-second window to react quickly. Once the spike subsides, panic mode exits and normal scaling resumes.

You can tune panic mode thresholds:

template:
  metadata:
    annotations:
      autoscaling.knative.dev/panic-threshold-percentage: "200"
      autoscaling.knative.dev/panic-window-percentage: "10"

panic-threshold-percentage: 200 means panic mode triggers when concurrency exceeds 200% of the target. panic-window-percentage: 10 sets the panic window to 10% of the stable window (6 seconds if the stable window is 60 seconds).

Max Scale

The max-scale annotation caps the number of replicas:

template:
  metadata:
    annotations:
      autoscaling.knative.dev/max-scale: "5"

This prevents runaway scaling during a traffic spike or DoS attack. Without a max-scale, a sudden flood of requests could spin up hundreds of pods and exhaust cluster resources.

Traffic Splitting

Knative supports fine-grained traffic splitting across Revisions. This enables canary deployments, blue-green deployments, and A/B testing.

Traffic Block Syntax

The traffic block defines how traffic is distributed:

# From manifests/service-hello-v2.yaml
traffic:
- latestRevision: true
  percent: 80
- latestRevision: false
  percent: 20
  revisionName: hello-00001

This sends 80% of traffic to the latest Revision (hello-00002) and 20% to a pinned Revision (hello-00001). The percentages must sum to 100.

Pinning to Specific Revisions

Instead of latestRevision: true, you can pin both routes to specific Revisions:

# From manifests/service-hello-v3.yaml
traffic:
- revisionName: hello-00002
  percent: 50
- revisionName: hello-00003
  percent: 50

This creates a 50/50 blue-green split between Revision 2 and Revision 3. Even if you update the Service again (creating hello-00004), traffic continues to split 50/50 between hello-00002 and hello-00003 because the traffic block explicitly pins them.

Tagged URLs

You can assign tags to Revisions for stable URLs:

traffic:
- revisionName: hello-00001
  percent: 100
  tag: stable
- revisionName: hello-00002
  percent: 0
  tag: canary

This routes 100% of traffic to hello-00001 but creates a separate URL for hello-00002:

http://hello.knative-demo.svc.cluster.local       # 100% hello-00001
http://stable-hello.knative-demo.svc.cluster.local   # hello-00001
http://canary-hello.knative-demo.svc.cluster.local   # hello-00002

The canary tag receives 0% of the main traffic but is reachable via its own URL for testing.

Canary Deployment Pattern

A typical canary rollout:

Deploy new version with 95% stable, 5% canary.
Monitor metrics (error rate, latency) for the canary.
If metrics look good, increase to 80% stable, 20% canary.
Continue increasing until 0% stable, 100% canary.
Remove the old Revision from the traffic block.

Knative does not automate this process. You must manually update the traffic block at each phase. For automated canary analysis, use Flagger or Argo Rollouts (which support Knative as a deployment target).

Networking Layer

Knative Serving requires a networking layer to route external traffic to Services. The most common options are Kourier, Istio, and Contour.

Kourier

Kourier is a lightweight Envoy-based ingress controller designed specifically for Knative. It has minimal dependencies and is easier to set up than Istio.

The demo uses Kourier:

kubectl apply -f https://github.com/knative/net-kourier/releases/latest/download/kourier.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
  -p '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

Kourier creates an Envoy proxy that handles HTTP routing based on the Route’s traffic split configuration.

Istio

Istio is a full service mesh. It provides advanced traffic management (mirroring, fault injection), mTLS between services, and detailed observability.

If you already run Istio for other workloads, use it for Knative:

kubectl apply -f https://github.com/knative/net-istio/releases/latest/download/net-istio.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
  -p '{"data":{"ingress-class":"istio.ingress.networking.knative.dev"}}'

Istio adds significant complexity and resource overhead. Use Kourier unless you need Istio’s advanced features.

Contour

Contour is another Envoy-based ingress controller. It is more feature-rich than Kourier but lighter than Istio.

kubectl apply -f https://github.com/knative/net-contour/releases/latest/download/contour.yaml
kubectl patch configmap/config-network -n knative-serving --type merge \
  -p '{"data":{"ingress-class":"contour.ingress.networking.knative.dev"}}'

Contour supports advanced routing rules and integrates with cert-manager for TLS.

DNS Configuration

Knative generates URLs like http://hello.knative-demo.10.0.0.1.sslip.io. The domain comes from the config-domain ConfigMap.

For local development with minikube, use the serving-default-domain job:

kubectl apply -f https://github.com/knative/serving/releases/latest/download/serving-default-domain.yaml

This automatically configures sslip.io DNS, which resolves 10.0.0.1.sslip.io to 10.0.0.1. Replace 10.0.0.1 with your minikube IP and DNS works.

For production, configure a real domain:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-domain
  namespace: knative-serving
data:
  example.com: ""

Services in the default namespace get URLs like http://hello.default.example.com.

The Queue Proxy Sidecar

Every Knative pod includes a Queue Proxy container. This sidecar sits in front of the application container and handles several critical tasks.

Request Metrics Collection

The Queue Proxy tracks how many requests are currently in flight (concurrency) and how many requests per second are being served. These metrics feed the autoscaler.

Without the Queue Proxy, the autoscaler would have no visibility into request load. It would have to rely on CPU or memory, which are poor proxies for HTTP load.

Container Concurrency Enforcement

You can limit how many concurrent requests a single pod accepts:

spec:
  containerConcurrency: 10

This tells the Queue Proxy to reject requests when 10 are already in flight. Excess requests are queued at the Activator or load balancer until a pod becomes available.

This is useful for workloads that consume significant per-request resources (like database connections). Without a concurrency limit, a single pod might accept 1000 concurrent requests and exhaust its connection pool.

The default containerConcurrency: 0 means unlimited.

Health Checks

The Queue Proxy exposes health check endpoints that Kubernetes uses for liveness and readiness probes. This decouples the application’s health check logic from Kubernetes probe configuration.

Trade-offs and Alternatives

Knative Serving is not the only way to run serverless workloads on Kubernetes.

Knative vs OpenFaaS

OpenFaaS is a simpler serverless framework. You write a function, wrap it in a container, and deploy it with the faas-cli. OpenFaaS scales based on RPS with Prometheus metrics.

OpenFaaS advantages: Easier to get started. Less Kubernetes knowledge required. Built-in UI for deploying functions.

Knative advantages: Native Kubernetes resources. Better integration with GitOps tools (Argo CD, Flux). More flexible traffic splitting. Revision-based deployments.

Use OpenFaaS if you want a batteries-included platform for deploying functions. Use Knative if you want serverless capabilities that feel like native Kubernetes.

Knative vs AWS Lambda

AWS Lambda is a fully managed serverless platform. You upload code, AWS runs it, you pay per invocation.

Lambda advantages: Zero infrastructure management. Instant scaling to thousands of concurrent invocations. Built-in integrations with AWS services.

Knative advantages: Runs on any Kubernetes cluster (on-prem, multi-cloud). No vendor lock-in. Full control over the container image and runtime.

Use Lambda if you are all-in on AWS and want zero operational overhead. Use Knative if you need portability or want to avoid cloud provider lock-in.

Knative vs Deployments with HPA

A standard Deployment with an HPA can autoscale based on CPU or custom metrics.

HPA advantages: Simpler. No new CRDs or components. Well understood by most Kubernetes users.

Knative advantages: Scale to zero. Concurrency-based autoscaling. Revision-based traffic splitting. Faster scaling (2-second metric scrape vs 15-second).

Use HPA for always-on workloads with predictable traffic. Use Knative for bursty workloads where scale-to-zero saves significant resources.

Production Considerations

Cold Start Optimization

Minimize cold start latency with these techniques:

Use small container images: Multi-stage builds, distroless base images, and layer caching reduce pull time.

Optimize application startup: Defer expensive initialization (loading ML models, warming caches) until after the first request.

Set aggressive readiness probes: The pod cannot receive traffic until it passes readiness. Reduce initialDelaySeconds and periodSeconds if your app starts quickly.

Keep warm with min-scale: For latency-sensitive workloads, set min-scale: 1 or higher to keep a pool of warm pods.

Pre-warm with traffic: Send periodic health check requests to prevent scale-to-zero during low-traffic periods.

Resource Limits

Always set CPU and memory limits on Knative Services:

resources:
  requests:
    cpu: 100m
    memory: 64Mi
  limits:
    cpu: 200m
    memory: 128Mi

Without limits, a single pod could consume unbounded resources. With hundreds of autoscaled pods, this could exhaust the cluster.

Requests are used for scheduling. Limits are enforced by the container runtime. Set limits slightly higher than typical usage to allow for spikes without throttling.

Observability

Knative generates detailed metrics via Prometheus. Key metrics to monitor:

revision_request_count: Total requests per Revision.
revision_request_latencies: Request latency histogram.
autoscaler_desired_pods: The autoscaler’s target replica count.
autoscaler_actual_pods: Current replica count.
activator_request_count: Requests buffered by the Activator (cold starts).

Enable Prometheus scraping:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
data:
  metrics.backend-destination: "prometheus"

Then deploy a ServiceMonitor to scrape Knative pods.

Security

Network policies: Restrict which namespaces can reach Knative Services. By default, Routes are publicly accessible.

mTLS: If using Istio, enable mTLS between the ingress and Revisions.

Resource quotas: Set namespace-level ResourceQuotas to prevent a single Service from consuming all cluster resources.

Image scanning: Scan container images for vulnerabilities before deploying.

Common Pitfalls

Forgetting the DNS Configuration

Without DNS setup, Services get URLs like http://hello.knative-demo.svc.cluster.local. This works within the cluster but not from outside.

You must either configure a real domain in config-domain or use the serving-default-domain job for local development.

Networking Layer Mismatch

If you install Kourier but forget to patch config-network, Knative defaults to Istio. Services will be stuck in “NotReady” because the Istio ingress does not exist.

Always verify the ingress class:

kubectl get configmap config-network -n knative-serving -o yaml

Traffic Percentages Do Not Sum to 100

This is rejected by the API:

traffic:
- latestRevision: true
  percent: 80
- revisionName: hello-00001
  percent: 10

The sum is 90. Knative requires exactly 100. Add another route or adjust the percentages.

Revision Naming Confusion

Revisions are named <service-name>-<generation>. The generation is auto-incremented and not related to your version tags.

Do not hardcode Revision names like hello-00003 in YAML. Use latestRevision: true or tags. The generation number changes every time the Configuration is updated, including non-functional changes like annotations.

Scale-to-Zero During Load Tests

If you run a load test with long pauses between requests, Knative might scale to zero mid-test. Your results will include cold start latency and look terrible.

Either set min-scale: 1 during testing or generate continuous load.

Container Concurrency Too Low

Setting containerConcurrency: 1 means each pod handles only one request at a time. If you have high RPS, the autoscaler spins up hundreds of pods.

For most HTTP services, containerConcurrency: 0 (unlimited) is appropriate. The concurrency target controls scaling, not the per-pod limit.