Skip to content

Chaos Engineering: Deep Dive

Chaos engineering is the practice of intentionally introducing failures into a system to verify that it handles them gracefully. The goal is not to break things for fun, but to find weaknesses before they cause real outages.

The discipline was formalized by Netflix (Chaos Monkey, 2011) and follows four principles:

  1. Define steady state - what does “working” look like? (e.g., all pods running, latency under 200ms)
  2. Hypothesize - “if I kill a pod, the Deployment controller will replace it within 30 seconds”
  3. Introduce chaos - run the experiment
  4. Observe - did the system behave as expected? If not, you found a weakness to fix

Kubernetes controllers run continuous reconciliation loops:

Desired State (spec) ──> Compare ──> Actual State (status)
Take corrective action

When you delete a pod, the Deployment controller detects that status.replicas < spec.replicas and creates a replacement. This loop runs every few seconds, so recovery is fast.

When a pod is deleted:

  1. 0s - API server marks pod for deletion
  2. 0-2s - Deployment controller notices the discrepancy
  3. 2-5s - New pod is scheduled to a node
  4. 5-30s - Container image is pulled (if not cached)
  5. 30s+ - Readiness probe passes, pod receives traffic

With pre-pulled images, total recovery is typically under 10 seconds.

A PDB declares the minimum number (or percentage) of pods that must remain available during voluntary disruptions. Voluntary disruptions include:

  • kubectl drain (node maintenance)
  • kubectl delete pod (manual deletion)
  • Cluster autoscaler removing a node
  • Deployment rolling updates

Involuntary disruptions (node crash, OOMKill, hardware failure) are NOT governed by PDBs. The PDB cannot prevent a kernel OOMKill.

The cluster autoscaler respects PDBs when deciding whether to remove a node. If removing a node would violate a PDB (not enough pods would remain), the autoscaler skips that node. This can prevent scale-down, so set PDBs carefully:

  • minAvailable: 1 on a 2-replica deployment means the autoscaler can remove nodes freely (as long as 1 pod stays)
  • minAvailable: 100% blocks all voluntary disruptions, including node maintenance

During a Deployment rolling update, Kubernetes respects the PDB. If maxUnavailable: 1 and minAvailable: 3 on a 4-replica deployment, the update proceeds one pod at a time, waiting for each new pod to become ready before terminating the next old pod.

The kubelet uses Linux cgroups to enforce container memory limits:

  1. Container starts with a cgroup memory limit set to resources.limits.memory
  2. As the container allocates memory, the kernel tracks usage against the cgroup limit
  3. When usage hits the limit, the kernel OOM killer terminates the process
  4. The kubelet records the exit code (137) and reason (OOMKilled)
  • Request - used for scheduling. The scheduler only places a pod on a node with enough allocatable memory.
  • Limit - enforced at runtime via cgroups. Exceeding it triggers OOMKill.

Best practice: Set requests close to actual usage (for accurate scheduling) and limits at 1.5-2x requests (for burst headroom). If request equals limit, the pod gets Guaranteed QoS class.

When a node runs low on memory, the kubelet evicts pods in this order:

  1. BestEffort - no requests or limits set (evicted first)
  2. Burstable - requests set but lower than limits
  3. Guaranteed - requests equal limits (evicted last)

Introducing network delays, packet loss, or partitions requires tools that can manipulate the Linux networking stack:

  • tc (traffic control) - tc qdisc add dev eth0 root netem delay 200ms 50ms adds 200ms +/- 50ms latency
  • iptables - iptables -A OUTPUT -d 10.0.0.0/8 -j DROP drops all traffic to a subnet
  • Toxiproxy - application-level proxy that simulates network conditions

In Minikube, network chaos is limited because you would need privileged access to the node. In production, use a chaos framework.

Terminal window
kubectl run cpu-stress --image=busybox:1.36 -n chaos-demo \
--overrides='{"spec":{"containers":[{"name":"cpu-stress","image":"busybox:1.36","command":["sh","-c","while true; do :; done"],"resources":{"requests":{"cpu":"25m","memory":"32Mi"},"limits":{"cpu":"100m","memory":"64Mi"}}}]}}'

Unlike memory, exceeding CPU limits does not kill the container. The kernel throttles it instead (CFS bandwidth limiting). The pod stays running but performs poorly.

Fill a PVC to trigger disk pressure:

Terminal window
dd if=/dev/zero of=/data/fill bs=1M count=1024

If the node’s disk fills up, the kubelet starts evicting pods based on their ephemeral storage usage.

This demo uses plain kubectl, which is good for learning. For production chaos engineering, consider:

  • Kubernetes-native chaos framework with CRDs
  • Pre-built experiments: pod-delete, container-kill, network-loss, disk-fill
  • ChaosHub with community experiments
  • Integrates with CI/CD for automated chaos testing
  • Similar to LitmusChaos, also CRD-based
  • Strong network chaos support (delay, partition, bandwidth)
  • Dashboard for experiment management
  • Fine-grained scheduling (cron-based chaos)
  • Commercial platform with a free tier
  • Agent-based (no CRDs required)
  • Supports non-Kubernetes targets (VMs, containers, bare metal)
  • Built-in safety features (blast radius limits, automatic rollback)
  1. Start with known-good systems - run chaos on systems that should be resilient, not ones you know are fragile
  2. Minimize blast radius - target one pod or one namespace, not the whole cluster
  3. Run in staging first - validate experiments before running them in production
  4. Automate - chaos should be a regular, automated practice, not a one-time event
  5. Track results - document what each experiment found and what was fixed
  6. Game days - periodically run chaos experiments with the whole team watching, to practice incident response