Progressive Delivery: Deep Dive
What is Progressive Delivery
Section titled “What is Progressive Delivery”Progressive delivery extends continuous delivery by adding gradual rollouts with automated verification. Instead of deploying a new version to all users at once, traffic shifts incrementally while health metrics are monitored. If something goes wrong, the rollout stops or rolls back automatically.
The term was coined by James Governor (RedMonk) and popularized by tools like Argo Rollouts, Flagger, and LaunchDarkly.
Why Not Just Use Deployment Rolling Updates
Section titled “Why Not Just Use Deployment Rolling Updates”A standard Kubernetes Deployment rolling update replaces old pods with new ones at a controlled rate (maxSurge, maxUnavailable). But it has limitations:
- No traffic splitting - as soon as a new pod is ready, it receives equal traffic. There is no way to send 10% of traffic to the new version.
- No automated analysis - the Deployment controller checks readiness probes, but it cannot query Prometheus for error rates or latency.
- No automatic rollback on metrics - if the new version passes readiness probes but has higher error rates, the Deployment does not roll back.
- No pause between steps - the rollout proceeds as fast as pods become ready.
Argo Rollouts solves all of these.
Argo Rollouts Architecture
Section titled “Argo Rollouts Architecture”Components
Section titled “Components”- Rollout Controller - watches Rollout CRDs and manages ReplicaSets, similar to the Deployment controller
- AnalysisRun Controller - executes analysis templates and reports results
- Rollout CRD - replaces Deployment with the same pod spec but adds strategy configuration
- AnalysisTemplate CRD - defines metrics to check during rollout
- AnalysisRun CRD - an instance of an AnalysisTemplate, created automatically during rollout
How Canary Works
Section titled “How Canary Works”- You apply a Rollout with a new pod template (e.g., new image tag)
- The controller creates a canary ReplicaSet with the new template
- The controller scales canary pods according to the step weights
- If the Rollout references an AnalysisTemplate, the controller creates an AnalysisRun
- The AnalysisRun periodically checks metrics (Prometheus, HTTP, Job-based)
- If analysis passes, the controller proceeds to the next step
- If analysis fails, the controller aborts and scales down the canary
Service Mesh Integration
Section titled “Service Mesh Integration”For precise traffic splitting (not just replica-based weighting), Argo Rollouts integrates with:
- Istio - uses VirtualService to split traffic by percentage
- NGINX Ingress - uses canary annotations on Ingress
- AWS ALB - uses target group weights
- SMI (Service Mesh Interface) - generic traffic split API
Without a service mesh, traffic splitting is approximated by the ratio of canary to stable pods. With 4 replicas and 1 canary pod, roughly 25% of traffic goes to canary (not exactly 10% as specified). The service mesh enables exact percentages.
AnalysisTemplate Patterns
Section titled “AnalysisTemplate Patterns”Prometheus-Based Analysis
Section titled “Prometheus-Based Analysis”The most common production pattern queries Prometheus for error rates:
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: error-ratespec: args: - name: service-name metrics: - name: error-rate interval: 30s count: 5 successCondition: result[0] < 0.05 failureLimit: 2 provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))This checks that the 5xx error rate stays below 5% over 5 measurements. If 2 measurements fail, the rollout aborts.
Job-Based Analysis
Section titled “Job-Based Analysis”For environments without Prometheus (like this demo), a Job runs a health check script:
provider: job: spec: template: spec: containers: - name: check image: curlimages/curl:latest command: [curl, -sf, "http://canary-service/health"] restartPolicy: NeverThe Job’s exit code determines success (0) or failure (non-zero).
Web Analysis
Section titled “Web Analysis”For external monitoring services:
provider: web: url: https://monitoring.example.com/api/v1/canary/{{args.service-name}}/health headers: - key: Authorization value: Bearer {{args.api-token}} jsonPath: "{$.healthy}" timeoutSeconds: 30Blue-Green vs Canary
Section titled “Blue-Green vs Canary”Blue-Green
Section titled “Blue-Green”- Two complete environments (blue = current, green = new)
- All traffic switches at once (0% -> 100%)
- Simpler to implement and reason about
- Requires 2x resources during deployment
- Rollback is instant (switch back to blue)
Canary
Section titled “Canary”- Gradual traffic shift (10% -> 50% -> 100%)
- Less resource overhead (only canary pods are extra)
- More complex traffic management
- Can catch issues that only appear under real traffic
- Rollback requires scaling down canary pods
When to Use Which
Section titled “When to Use Which”- Blue-green: database migrations, breaking API changes, regulated environments requiring full validation before exposure
- Canary: stateless services, high-traffic services where gradual exposure catches edge cases, services with good observability
GitOps with Argo Rollouts
Section titled “GitOps with Argo Rollouts”Argo Rollouts integrates naturally with Argo CD for GitOps workflows:
- Developer merges a PR that updates the image tag in the Rollout manifest
- Argo CD syncs the change to the cluster
- Argo Rollouts detects the spec change and begins the canary rollout
- AnalysisRuns verify health at each step
- If the rollout succeeds, the new version is stable
- If the rollout fails, Argo Rollouts aborts and Argo CD shows the degraded status
The key insight: Argo CD manages the desired state, Argo Rollouts manages the transition to that state.
Production Considerations
Section titled “Production Considerations”Step Configuration
Section titled “Step Configuration”Design your canary steps based on your traffic volume:
- Low traffic (< 100 req/s): fewer steps, longer pauses (need time to collect meaningful metrics)
- High traffic (> 10k req/s): more steps, shorter pauses (statistically significant data arrives quickly)
Example for a high-traffic service:
steps: - setWeight: 1 - pause: { duration: 5m } - setWeight: 5 - pause: { duration: 5m } - setWeight: 25 - pause: { duration: 5m } - setWeight: 50 - pause: { duration: 5m } - setWeight: 100Handling Stateful Rollouts
Section titled “Handling Stateful Rollouts”Canary is primarily designed for stateless services. For stateful services:
- Ensure backward-compatible schema changes (expand-and-contract pattern)
- Consider feature flags instead of canary deployments
- Use blue-green if the service requires exclusive access to shared state
Anti-Rollback Protection
Section titled “Anti-Rollback Protection”Argo Rollouts includes anti-rollback protection by default. If you try to roll back to a previous revision that already failed analysis, the controller blocks it. Override with:
kubectl argo rollouts undo canary-app -n rollouts-demo --to-revision=2Notifications
Section titled “Notifications”Argo Rollouts supports notifications via Argo Notifications:
- Slack messages on rollout start, promotion, abort
- Webhook triggers for external systems
- Custom templates for notification content
Comparison with Other Progressive Delivery Tools
Section titled “Comparison with Other Progressive Delivery Tools”Flagger
Section titled “Flagger”- Works with Istio, Linkerd, App Mesh, NGINX, and others
- Tighter Prometheus integration out of the box
- Uses Deployments (not a custom CRD), wrapping them with a Canary CRD
- More opinionated about traffic management
Tekton + Canary
Section titled “Tekton + Canary”- Use Tekton pipelines to orchestrate canary steps
- Manual traffic management via kubectl
- More flexible but more work to set up
- Good if you already use Tekton for CI/CD
Feature Flags (LaunchDarkly, Unleash)
Section titled “Feature Flags (LaunchDarkly, Unleash)”- Application-level, not infrastructure-level
- Toggle features per user, not per pod
- Can combine with canary: deploy canary pods with feature flag enabled
- Better for A/B testing and gradual feature exposure