Chaos Engineering
Deploy a resilient application, then deliberately break things and watch Kubernetes recover.
Time: ~15 minutes Difficulty: Advanced
Resources: This demo needs ~1GB RAM. Clean up other demos first:
task clean:all
What You Will Learn
Section titled “What You Will Learn”- How Kubernetes self-heals when pods are deleted
- PodDisruptionBudgets: maintaining minimum availability during disruptions
- OOMKill: what happens when a container exceeds its memory limit
- Scale-to-zero recovery: Kubernetes rebuilds from the Deployment spec
- Why resilient design (replicas, probes, PDBs, resource limits) matters
The Scenario
Section titled “The Scenario”A resilient nginx app with 4 replicas, health probes, resource limits, and a PodDisruptionBudget requiring at least 2 pods running at all times. We then run four chaos scenarios using nothing but kubectl.
┌─────────────────────────┐ │ resilient-app (4 pods) │ │ │ │ ┌─────┐ ┌─────┐ │ │ │pod-1│ │pod-2│ │ │ └─────┘ └─────┘ │ │ ┌─────┐ ┌─────┐ │ │ │pod-3│ │pod-4│ │ │ └─────┘ └─────┘ │ │ │ │ PDB: minAvailable=2 │ │ Liveness + Readiness │ │ CPU: 50m-100m │ │ Memory: 64Mi-128Mi │ └─────────────────────────┘No external chaos tools. Just kubectl and the built-in resilience of Kubernetes.
Deploy
Section titled “Deploy”Navigate to the demo directory:
cd demos/chaos-engineeringStep 1: Deploy the Resilient App
Section titled “Step 1: Deploy the Resilient App”kubectl apply -f manifests/namespace.yamlkubectl apply -f manifests/resilient-app.yamlkubectl apply -f manifests/chaos-scripts.yamlWait for all 4 pods to be ready:
kubectl get pods -n chaos-demo -l app=resilient-app -wVerify the PDB:
kubectl get pdb -n chaos-demoYou should see ALLOWED DISRUPTIONS: 2 (4 pods running, 2 minimum = 2 can be disrupted).
Step 2: Scenario 1 - Kill a Random Pod
Section titled “Step 2: Scenario 1 - Kill a Random Pod”Delete a pod and watch Kubernetes replace it immediately:
# Pick a podPOD=$(kubectl get pods -n chaos-demo -l app=resilient-app -o jsonpath='{.items[0].metadata.name}')echo "Killing: $POD"
# Delete itkubectl delete pod "$POD" -n chaos-demo --wait=false
# Watch the replacementkubectl get pods -n chaos-demo -l app=resilient-app -wWithin seconds, a new pod appears. The Deployment controller noticed the pod count dropped below 4 and scheduled a replacement. The PDB was not violated because only 1 of 4 pods was disrupted (2 minimum remained).
Press Ctrl+C to stop watching.
Step 3: Scenario 2 - Trigger OOMKill
Section titled “Step 3: Scenario 2 - Trigger OOMKill”Create a pod that tries to allocate more memory than its 64Mi limit:
kubectl apply -f - <<'EOF'apiVersion: v1kind: Podmetadata: name: memory-hog namespace: chaos-demo labels: chaos: oom-testspec: containers: - name: memory-hog image: busybox:1.36 command: - sh - -c - | echo "Allocating memory until OOMKilled..." dd if=/dev/zero of=/dev/shm/fill bs=1M count=128 2>/dev/null echo "Should not reach here" resources: requests: cpu: 25m memory: 32Mi limits: cpu: 50m memory: 64Mi restartPolicy: NeverEOFWatch the pod get OOMKilled:
kubectl get pod memory-hog -n chaos-demo -wAfter a few seconds, the status changes to OOMKilled. Check the details:
kubectl describe pod memory-hog -n chaos-demo | grep -A 5 "Last State"kubectl describe pod memory-hog -n chaos-demo | grep "OOMKilled"The container tried to use 128MB but had a 64Mi limit. The kernel killed it. Since restartPolicy: Never, it stays dead. In a Deployment, the pod would be replaced automatically.
Clean up the memory-hog:
kubectl delete pod memory-hog -n chaos-demoStep 4: Scenario 3 - Scale to Zero and Back
Section titled “Step 4: Scenario 3 - Scale to Zero and Back”Scale the deployment down to zero, then recover:
# Show current statekubectl get pods -n chaos-demo -l app=resilient-app
# Scale to zerokubectl scale deployment resilient-app --replicas=0 -n chaos-demo
# Verify all pods are gonekubectl get pods -n chaos-demo -l app=resilient-app# No resources found
# Scale back to 4kubectl scale deployment resilient-app --replicas=4 -n chaos-demo
# Watch recoverykubectl get pods -n chaos-demo -l app=resilient-app -wAll 4 pods come back because the Deployment spec still exists. Kubernetes does not need the old pods to recreate the workload. The spec is the source of truth.
Press Ctrl+C to stop watching.
Step 5: Scenario 4 - Delete the Deployment
Section titled “Step 5: Scenario 4 - Delete the Deployment”Delete the entire Deployment and observe cleanup:
# Delete the deploymentkubectl delete deployment resilient-app -n chaos-demo
# Watch pods terminatekubectl get pods -n chaos-demo -l app=resilient-app -wAll pods are terminated. The namespace stays clean. The PDB and Service remain but have no matching pods.
# PDB still exists but reports 0 podskubectl get pdb -n chaos-demo
# Service has no endpointskubectl get endpoints resilient-app -n chaos-demoRe-deploy to restore:
kubectl apply -f manifests/resilient-app.yamlkubectl get pods -n chaos-demo -l app=resilient-app -wWhat is Happening
Section titled “What is Happening”manifests/ namespace.yaml # chaos-demo namespace resilient-app.yaml # 4-replica Deployment + Service + PDB chaos-scripts.yaml # ConfigMap with shell scripts for referenceResilience layers tested:
| Layer | What It Does | Chaos Scenario |
|---|---|---|
| Deployment controller | Maintains desired pod count | Kill pod, scale to zero |
| PodDisruptionBudget | Guarantees minimum availability | Kill pod (PDB protects) |
| Resource limits | Prevents unbounded resource usage | OOMKill |
| Liveness probe | Restarts unhealthy containers | Container crash |
| Readiness probe | Removes unready pods from Service | Pod startup |
Key insight: Kubernetes does not prevent failures. It expects them and recovers automatically. Your job is to declare the desired state (replicas, probes, PDBs, limits). The control plane does the rest.
Experiment
Section titled “Experiment”-
Kill 3 pods at once and see the PDB in action:
Terminal window kubectl delete pod -l app=resilient-app -n chaos-demo --wait=falsekubectl get pods -n chaos-demo -l app=resilient-app -w -
Watch the Deployment events during chaos:
Terminal window kubectl describe deployment resilient-app -n chaos-demo | grep -A 10 "Events" -
Check the PDB status during disruption:
Terminal window kubectl get pdb resilient-app-pdb -n chaos-demo -o yaml | grep -A 5 status -
Try a rolling restart (zero-downtime):
Terminal window kubectl rollout restart deployment resilient-app -n chaos-demokubectl rollout status deployment resilient-app -n chaos-demo
Cleanup
Section titled “Cleanup”kubectl delete namespace chaos-demoFurther Reading
Section titled “Further Reading”See docs/deep-dive.md for a detailed explanation of chaos engineering principles, the relationship between PDBs and cluster autoscaling, how the kubelet enforces memory limits via cgroups, and when to graduate from kubectl chaos to tools like LitmusChaos or Chaos Mesh.
Next Step
Section titled “Next Step”Move on to Progressive Delivery to learn canary deployments with Argo Rollouts.