Progressive Delivery: Deep Dive

What is Progressive Delivery

Progressive delivery extends continuous delivery by adding gradual rollouts with automated verification. Instead of deploying a new version to all users at once, traffic shifts incrementally while health metrics are monitored. If something goes wrong, the rollout stops or rolls back automatically.

The term was coined by James Governor (RedMonk) and popularized by tools like Argo Rollouts, Flagger, and LaunchDarkly.

Why Not Just Use Deployment Rolling Updates

A standard Kubernetes Deployment rolling update replaces old pods with new ones at a controlled rate (maxSurge, maxUnavailable). But it has limitations:

No traffic splitting - as soon as a new pod is ready, it receives equal traffic. There is no way to send 10% of traffic to the new version.
No automated analysis - the Deployment controller checks readiness probes, but it cannot query Prometheus for error rates or latency.
No automatic rollback on metrics - if the new version passes readiness probes but has higher error rates, the Deployment does not roll back.
No pause between steps - the rollout proceeds as fast as pods become ready.

Argo Rollouts solves all of these.

Argo Rollouts Architecture

Components

Rollout Controller - watches Rollout CRDs and manages ReplicaSets, similar to the Deployment controller
AnalysisRun Controller - executes analysis templates and reports results
Rollout CRD - replaces Deployment with the same pod spec but adds strategy configuration
AnalysisTemplate CRD - defines metrics to check during rollout
AnalysisRun CRD - an instance of an AnalysisTemplate, created automatically during rollout

How Canary Works

You apply a Rollout with a new pod template (e.g., new image tag)
The controller creates a canary ReplicaSet with the new template
The controller scales canary pods according to the step weights
If the Rollout references an AnalysisTemplate, the controller creates an AnalysisRun
The AnalysisRun periodically checks metrics (Prometheus, HTTP, Job-based)
If analysis passes, the controller proceeds to the next step
If analysis fails, the controller aborts and scales down the canary

Service Mesh Integration

For precise traffic splitting (not just replica-based weighting), Argo Rollouts integrates with:

Istio - uses VirtualService to split traffic by percentage
NGINX Ingress - uses canary annotations on Ingress
AWS ALB - uses target group weights
SMI (Service Mesh Interface) - generic traffic split API

Without a service mesh, traffic splitting is approximated by the ratio of canary to stable pods. With 4 replicas and 1 canary pod, roughly 25% of traffic goes to canary (not exactly 10% as specified). The service mesh enables exact percentages.

AnalysisTemplate Patterns

Prometheus-Based Analysis

The most common production pattern queries Prometheus for error rates:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 30s
      count: 5
      successCondition: result[0] < 0.05
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

This checks that the 5xx error rate stays below 5% over 5 measurements. If 2 measurements fail, the rollout aborts.

Job-Based Analysis

For environments without Prometheus (like this demo), a Job runs a health check script:

provider:
  job:
    spec:
      template:
        spec:
          containers:
            - name: check
              image: curlimages/curl:latest
              command: [curl, -sf, "http://canary-service/health"]
          restartPolicy: Never

The Job’s exit code determines success (0) or failure (non-zero).

Web Analysis

For external monitoring services:

provider:
  web:
    url: https://monitoring.example.com/api/v1/canary/{{args.service-name}}/health
    headers:
      - key: Authorization
        value: Bearer {{args.api-token}}
    jsonPath: "{$.healthy}"
    timeoutSeconds: 30

Blue-Green vs Canary

Blue-Green

Two complete environments (blue = current, green = new)
All traffic switches at once (0% -> 100%)
Simpler to implement and reason about
Requires 2x resources during deployment
Rollback is instant (switch back to blue)

Canary

Gradual traffic shift (10% -> 50% -> 100%)
Less resource overhead (only canary pods are extra)
More complex traffic management
Can catch issues that only appear under real traffic
Rollback requires scaling down canary pods

When to Use Which

Blue-green: database migrations, breaking API changes, regulated environments requiring full validation before exposure
Canary: stateless services, high-traffic services where gradual exposure catches edge cases, services with good observability

GitOps with Argo Rollouts

Argo Rollouts integrates naturally with Argo CD for GitOps workflows:

Developer merges a PR that updates the image tag in the Rollout manifest
Argo CD syncs the change to the cluster
Argo Rollouts detects the spec change and begins the canary rollout
AnalysisRuns verify health at each step
If the rollout succeeds, the new version is stable
If the rollout fails, Argo Rollouts aborts and Argo CD shows the degraded status

The key insight: Argo CD manages the desired state, Argo Rollouts manages the transition to that state.

Production Considerations

Step Configuration

Design your canary steps based on your traffic volume:

Low traffic (< 100 req/s): fewer steps, longer pauses (need time to collect meaningful metrics)
High traffic (> 10k req/s): more steps, shorter pauses (statistically significant data arrives quickly)

Example for a high-traffic service:

steps:
  - setWeight: 1
  - pause: { duration: 5m }
  - setWeight: 5
  - pause: { duration: 5m }
  - setWeight: 25
  - pause: { duration: 5m }
  - setWeight: 50
  - pause: { duration: 5m }
  - setWeight: 100

Handling Stateful Rollouts

Canary is primarily designed for stateless services. For stateful services:

Ensure backward-compatible schema changes (expand-and-contract pattern)
Consider feature flags instead of canary deployments
Use blue-green if the service requires exclusive access to shared state

Anti-Rollback Protection

Argo Rollouts includes anti-rollback protection by default. If you try to roll back to a previous revision that already failed analysis, the controller blocks it. Override with:

kubectl argo rollouts undo canary-app -n rollouts-demo --to-revision=2

Notifications

Argo Rollouts supports notifications via Argo Notifications:

Slack messages on rollout start, promotion, abort
Webhook triggers for external systems
Custom templates for notification content

Comparison with Other Progressive Delivery Tools

Flagger

Works with Istio, Linkerd, App Mesh, NGINX, and others
Tighter Prometheus integration out of the box
Uses Deployments (not a custom CRD), wrapping them with a Canary CRD
More opinionated about traffic management

Tekton + Canary

Use Tekton pipelines to orchestrate canary steps
Manual traffic management via kubectl
More flexible but more work to set up
Good if you already use Tekton for CI/CD

Feature Flags (LaunchDarkly, Unleash)

Application-level, not infrastructure-level
Toggle features per user, not per pod
Can combine with canary: deploy canary pods with feature flag enabled
Better for A/B testing and gradual feature exposure