Chaos Engineering

Deploy a resilient application, then deliberately break things and watch Kubernetes recover.

Time: ~15 minutes Difficulty: Advanced

Resources: This demo needs ~1GB RAM. Clean up other demos first: task clean:all

What You Will Learn

How Kubernetes self-heals when pods are deleted
PodDisruptionBudgets: maintaining minimum availability during disruptions
OOMKill: what happens when a container exceeds its memory limit
Scale-to-zero recovery: Kubernetes rebuilds from the Deployment spec
Why resilient design (replicas, probes, PDBs, resource limits) matters

The Scenario

A resilient nginx app with 4 replicas, health probes, resource limits, and a PodDisruptionBudget requiring at least 2 pods running at all times. We then run four chaos scenarios using nothing but kubectl.

                  ┌─────────────────────────┐
                  │   resilient-app (4 pods) │
                  │                         │
                  │  ┌─────┐ ┌─────┐       │
                  │  │pod-1│ │pod-2│       │
                  │  └─────┘ └─────┘       │
                  │  ┌─────┐ ┌─────┐       │
                  │  │pod-3│ │pod-4│       │
                  │  └─────┘ └─────┘       │
                  │                         │
                  │  PDB: minAvailable=2    │
                  │  Liveness + Readiness   │
                  │  CPU: 50m-100m          │
                  │  Memory: 64Mi-128Mi     │
                  └─────────────────────────┘

No external chaos tools. Just kubectl and the built-in resilience of Kubernetes.

Deploy

Navigate to the demo directory:

cd demos/chaos-engineering

Step 1: Deploy the Resilient App

kubectl apply -f manifests/namespace.yaml
kubectl apply -f manifests/resilient-app.yaml
kubectl apply -f manifests/chaos-scripts.yaml

Wait for all 4 pods to be ready:

kubectl get pods -n chaos-demo -l app=resilient-app -w

Verify the PDB:

kubectl get pdb -n chaos-demo

You should see ALLOWED DISRUPTIONS: 2 (4 pods running, 2 minimum = 2 can be disrupted).

Step 2: Scenario 1 - Kill a Random Pod

Delete a pod and watch Kubernetes replace it immediately:

# Pick a pod
POD=$(kubectl get pods -n chaos-demo -l app=resilient-app -o jsonpath='{.items[0].metadata.name}')
echo "Killing: $POD"

# Delete it
kubectl delete pod "$POD" -n chaos-demo --wait=false

# Watch the replacement
kubectl get pods -n chaos-demo -l app=resilient-app -w

Within seconds, a new pod appears. The Deployment controller noticed the pod count dropped below 4 and scheduled a replacement. The PDB was not violated because only 1 of 4 pods was disrupted (2 minimum remained).

Press Ctrl+C to stop watching.

Step 3: Scenario 2 - Trigger OOMKill

Create a pod that tries to allocate more memory than its 64Mi limit:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: memory-hog
  namespace: chaos-demo
  labels:
    chaos: oom-test
spec:
  containers:
    - name: memory-hog
      image: busybox:1.36
      command:
        - sh
        - -c
        - |
          echo "Allocating memory until OOMKilled..."
          dd if=/dev/zero of=/dev/shm/fill bs=1M count=128 2>/dev/null
          echo "Should not reach here"
      resources:
        requests:
          cpu: 25m
          memory: 32Mi
        limits:
          cpu: 50m
          memory: 64Mi
  restartPolicy: Never
EOF

Watch the pod get OOMKilled:

kubectl get pod memory-hog -n chaos-demo -w

After a few seconds, the status changes to OOMKilled. Check the details:

kubectl describe pod memory-hog -n chaos-demo | grep -A 5 "Last State"
kubectl describe pod memory-hog -n chaos-demo | grep "OOMKilled"

The container tried to use 128MB but had a 64Mi limit. The kernel killed it. Since restartPolicy: Never, it stays dead. In a Deployment, the pod would be replaced automatically.

Clean up the memory-hog:

kubectl delete pod memory-hog -n chaos-demo

Step 4: Scenario 3 - Scale to Zero and Back

Scale the deployment down to zero, then recover:

# Show current state
kubectl get pods -n chaos-demo -l app=resilient-app

# Scale to zero
kubectl scale deployment resilient-app --replicas=0 -n chaos-demo

# Verify all pods are gone
kubectl get pods -n chaos-demo -l app=resilient-app
# No resources found

# Scale back to 4
kubectl scale deployment resilient-app --replicas=4 -n chaos-demo

# Watch recovery
kubectl get pods -n chaos-demo -l app=resilient-app -w

All 4 pods come back because the Deployment spec still exists. Kubernetes does not need the old pods to recreate the workload. The spec is the source of truth.

Press Ctrl+C to stop watching.

Step 5: Scenario 4 - Delete the Deployment

Delete the entire Deployment and observe cleanup:

# Delete the deployment
kubectl delete deployment resilient-app -n chaos-demo

# Watch pods terminate
kubectl get pods -n chaos-demo -l app=resilient-app -w

All pods are terminated. The namespace stays clean. The PDB and Service remain but have no matching pods.

# PDB still exists but reports 0 pods
kubectl get pdb -n chaos-demo

# Service has no endpoints
kubectl get endpoints resilient-app -n chaos-demo

Re-deploy to restore:

kubectl apply -f manifests/resilient-app.yaml
kubectl get pods -n chaos-demo -l app=resilient-app -w

What is Happening

manifests/
  namespace.yaml       # chaos-demo namespace
  resilient-app.yaml   # 4-replica Deployment + Service + PDB
  chaos-scripts.yaml   # ConfigMap with shell scripts for reference

Resilience layers tested:

Layer	What It Does	Chaos Scenario
Deployment controller	Maintains desired pod count	Kill pod, scale to zero
PodDisruptionBudget	Guarantees minimum availability	Kill pod (PDB protects)
Resource limits	Prevents unbounded resource usage	OOMKill
Liveness probe	Restarts unhealthy containers	Container crash
Readiness probe	Removes unready pods from Service	Pod startup

Key insight: Kubernetes does not prevent failures. It expects them and recovers automatically. Your job is to declare the desired state (replicas, probes, PDBs, limits). The control plane does the rest.

Experiment

Kill 3 pods at once and see the PDB in action:

kubectl delete pod -l app=resilient-app -n chaos-demo --wait=false
kubectl get pods -n chaos-demo -l app=resilient-app -w

Watch the Deployment events during chaos:

kubectl describe deployment resilient-app -n chaos-demo | grep -A 10 "Events"

Check the PDB status during disruption:

kubectl get pdb resilient-app-pdb -n chaos-demo -o yaml | grep -A 5 status

Try a rolling restart (zero-downtime):

kubectl rollout restart deployment resilient-app -n chaos-demo
kubectl rollout status deployment resilient-app -n chaos-demo

Cleanup

kubectl delete namespace chaos-demo

Next Step

Move on to Progressive Delivery to learn canary deployments with Argo Rollouts.