Skip to content

Chaos Engineering

Deploy a resilient application, then deliberately break things and watch Kubernetes recover.

Time: ~15 minutes Difficulty: Advanced

Resources: This demo needs ~1GB RAM. Clean up other demos first: task clean:all

  • How Kubernetes self-heals when pods are deleted
  • PodDisruptionBudgets: maintaining minimum availability during disruptions
  • OOMKill: what happens when a container exceeds its memory limit
  • Scale-to-zero recovery: Kubernetes rebuilds from the Deployment spec
  • Why resilient design (replicas, probes, PDBs, resource limits) matters

A resilient nginx app with 4 replicas, health probes, resource limits, and a PodDisruptionBudget requiring at least 2 pods running at all times. We then run four chaos scenarios using nothing but kubectl.

┌─────────────────────────┐
│ resilient-app (4 pods) │
│ │
│ ┌─────┐ ┌─────┐ │
│ │pod-1│ │pod-2│ │
│ └─────┘ └─────┘ │
│ ┌─────┐ ┌─────┐ │
│ │pod-3│ │pod-4│ │
│ └─────┘ └─────┘ │
│ │
│ PDB: minAvailable=2 │
│ Liveness + Readiness │
│ CPU: 50m-100m │
│ Memory: 64Mi-128Mi │
└─────────────────────────┘

No external chaos tools. Just kubectl and the built-in resilience of Kubernetes.

Navigate to the demo directory:

Terminal window
cd demos/chaos-engineering
Terminal window
kubectl apply -f manifests/namespace.yaml
kubectl apply -f manifests/resilient-app.yaml
kubectl apply -f manifests/chaos-scripts.yaml

Wait for all 4 pods to be ready:

Terminal window
kubectl get pods -n chaos-demo -l app=resilient-app -w

Verify the PDB:

Terminal window
kubectl get pdb -n chaos-demo

You should see ALLOWED DISRUPTIONS: 2 (4 pods running, 2 minimum = 2 can be disrupted).

Delete a pod and watch Kubernetes replace it immediately:

Terminal window
# Pick a pod
POD=$(kubectl get pods -n chaos-demo -l app=resilient-app -o jsonpath='{.items[0].metadata.name}')
echo "Killing: $POD"
# Delete it
kubectl delete pod "$POD" -n chaos-demo --wait=false
# Watch the replacement
kubectl get pods -n chaos-demo -l app=resilient-app -w

Within seconds, a new pod appears. The Deployment controller noticed the pod count dropped below 4 and scheduled a replacement. The PDB was not violated because only 1 of 4 pods was disrupted (2 minimum remained).

Press Ctrl+C to stop watching.

Create a pod that tries to allocate more memory than its 64Mi limit:

Terminal window
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: memory-hog
namespace: chaos-demo
labels:
chaos: oom-test
spec:
containers:
- name: memory-hog
image: busybox:1.36
command:
- sh
- -c
- |
echo "Allocating memory until OOMKilled..."
dd if=/dev/zero of=/dev/shm/fill bs=1M count=128 2>/dev/null
echo "Should not reach here"
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi
restartPolicy: Never
EOF

Watch the pod get OOMKilled:

Terminal window
kubectl get pod memory-hog -n chaos-demo -w

After a few seconds, the status changes to OOMKilled. Check the details:

Terminal window
kubectl describe pod memory-hog -n chaos-demo | grep -A 5 "Last State"
kubectl describe pod memory-hog -n chaos-demo | grep "OOMKilled"

The container tried to use 128MB but had a 64Mi limit. The kernel killed it. Since restartPolicy: Never, it stays dead. In a Deployment, the pod would be replaced automatically.

Clean up the memory-hog:

Terminal window
kubectl delete pod memory-hog -n chaos-demo

Step 4: Scenario 3 - Scale to Zero and Back

Section titled “Step 4: Scenario 3 - Scale to Zero and Back”

Scale the deployment down to zero, then recover:

Terminal window
# Show current state
kubectl get pods -n chaos-demo -l app=resilient-app
# Scale to zero
kubectl scale deployment resilient-app --replicas=0 -n chaos-demo
# Verify all pods are gone
kubectl get pods -n chaos-demo -l app=resilient-app
# No resources found
# Scale back to 4
kubectl scale deployment resilient-app --replicas=4 -n chaos-demo
# Watch recovery
kubectl get pods -n chaos-demo -l app=resilient-app -w

All 4 pods come back because the Deployment spec still exists. Kubernetes does not need the old pods to recreate the workload. The spec is the source of truth.

Press Ctrl+C to stop watching.

Step 5: Scenario 4 - Delete the Deployment

Section titled “Step 5: Scenario 4 - Delete the Deployment”

Delete the entire Deployment and observe cleanup:

Terminal window
# Delete the deployment
kubectl delete deployment resilient-app -n chaos-demo
# Watch pods terminate
kubectl get pods -n chaos-demo -l app=resilient-app -w

All pods are terminated. The namespace stays clean. The PDB and Service remain but have no matching pods.

Terminal window
# PDB still exists but reports 0 pods
kubectl get pdb -n chaos-demo
# Service has no endpoints
kubectl get endpoints resilient-app -n chaos-demo

Re-deploy to restore:

Terminal window
kubectl apply -f manifests/resilient-app.yaml
kubectl get pods -n chaos-demo -l app=resilient-app -w
manifests/
namespace.yaml # chaos-demo namespace
resilient-app.yaml # 4-replica Deployment + Service + PDB
chaos-scripts.yaml # ConfigMap with shell scripts for reference

Resilience layers tested:

LayerWhat It DoesChaos Scenario
Deployment controllerMaintains desired pod countKill pod, scale to zero
PodDisruptionBudgetGuarantees minimum availabilityKill pod (PDB protects)
Resource limitsPrevents unbounded resource usageOOMKill
Liveness probeRestarts unhealthy containersContainer crash
Readiness probeRemoves unready pods from ServicePod startup

Key insight: Kubernetes does not prevent failures. It expects them and recovers automatically. Your job is to declare the desired state (replicas, probes, PDBs, limits). The control plane does the rest.

  1. Kill 3 pods at once and see the PDB in action:

    Terminal window
    kubectl delete pod -l app=resilient-app -n chaos-demo --wait=false
    kubectl get pods -n chaos-demo -l app=resilient-app -w
  2. Watch the Deployment events during chaos:

    Terminal window
    kubectl describe deployment resilient-app -n chaos-demo | grep -A 10 "Events"
  3. Check the PDB status during disruption:

    Terminal window
    kubectl get pdb resilient-app-pdb -n chaos-demo -o yaml | grep -A 5 status
  4. Try a rolling restart (zero-downtime):

    Terminal window
    kubectl rollout restart deployment resilient-app -n chaos-demo
    kubectl rollout status deployment resilient-app -n chaos-demo
Terminal window
kubectl delete namespace chaos-demo

See docs/deep-dive.md for a detailed explanation of chaos engineering principles, the relationship between PDBs and cluster autoscaling, how the kubelet enforces memory limits via cgroups, and when to graduate from kubectl chaos to tools like LitmusChaos or Chaos Mesh.

Move on to Progressive Delivery to learn canary deployments with Argo Rollouts.