Chaos Engineering: Deep Dive

Principles of Chaos Engineering

Chaos engineering is the practice of intentionally introducing failures into a system to verify that it handles them gracefully. The goal is not to break things for fun, but to find weaknesses before they cause real outages.

The discipline was formalized by Netflix (Chaos Monkey, 2011) and follows four principles:

Define steady state - what does “working” look like? (e.g., all pods running, latency under 200ms)
Hypothesize - “if I kill a pod, the Deployment controller will replace it within 30 seconds”
Introduce chaos - run the experiment
Observe - did the system behave as expected? If not, you found a weakness to fix

How Kubernetes Self-Heals

The Control Loop

Kubernetes controllers run continuous reconciliation loops:

Desired State (spec) ──> Compare ──> Actual State (status)
                            │
                     Take corrective action

When you delete a pod, the Deployment controller detects that status.replicas < spec.replicas and creates a replacement. This loop runs every few seconds, so recovery is fast.

Pod Replacement Timing

When a pod is deleted:

0s - API server marks pod for deletion
0-2s - Deployment controller notices the discrepancy
2-5s - New pod is scheduled to a node
5-30s - Container image is pulled (if not cached)
30s+ - Readiness probe passes, pod receives traffic

With pre-pulled images, total recovery is typically under 10 seconds.

PodDisruptionBudgets in Detail

A PDB declares the minimum number (or percentage) of pods that must remain available during voluntary disruptions. Voluntary disruptions include:

kubectl drain (node maintenance)
kubectl delete pod (manual deletion)
Cluster autoscaler removing a node
Deployment rolling updates

Involuntary disruptions (node crash, OOMKill, hardware failure) are NOT governed by PDBs. The PDB cannot prevent a kernel OOMKill.

PDB and Cluster Autoscaler

The cluster autoscaler respects PDBs when deciding whether to remove a node. If removing a node would violate a PDB (not enough pods would remain), the autoscaler skips that node. This can prevent scale-down, so set PDBs carefully:

minAvailable: 1 on a 2-replica deployment means the autoscaler can remove nodes freely (as long as 1 pod stays)
minAvailable: 100% blocks all voluntary disruptions, including node maintenance

PDB During Rolling Updates

During a Deployment rolling update, Kubernetes respects the PDB. If maxUnavailable: 1 and minAvailable: 3 on a 4-replica deployment, the update proceeds one pod at a time, waiting for each new pod to become ready before terminating the next old pod.

Memory Limits and OOMKill

How the Kubelet Enforces Memory Limits

The kubelet uses Linux cgroups to enforce container memory limits:

Container starts with a cgroup memory limit set to resources.limits.memory
As the container allocates memory, the kernel tracks usage against the cgroup limit
When usage hits the limit, the kernel OOM killer terminates the process
The kubelet records the exit code (137) and reason (OOMKilled)

Memory Request vs Limit

Request - used for scheduling. The scheduler only places a pod on a node with enough allocatable memory.
Limit - enforced at runtime via cgroups. Exceeding it triggers OOMKill.

Best practice: Set requests close to actual usage (for accurate scheduling) and limits at 1.5-2x requests (for burst headroom). If request equals limit, the pod gets Guaranteed QoS class.

QoS Classes and Eviction Order

When a node runs low on memory, the kubelet evicts pods in this order:

BestEffort - no requests or limits set (evicted first)
Burstable - requests set but lower than limits
Guaranteed - requests equal limits (evicted last)

Chaos Scenarios Beyond This Demo

Network Chaos

Introducing network delays, packet loss, or partitions requires tools that can manipulate the Linux networking stack:

tc (traffic control) - tc qdisc add dev eth0 root netem delay 200ms 50ms adds 200ms +/- 50ms latency
iptables - iptables -A OUTPUT -d 10.0.0.0/8 -j DROP drops all traffic to a subnet
Toxiproxy - application-level proxy that simulates network conditions

In Minikube, network chaos is limited because you would need privileged access to the node. In production, use a chaos framework.

CPU Stress

kubectl run cpu-stress --image=busybox:1.36 -n chaos-demo \
  --overrides='{"spec":{"containers":[{"name":"cpu-stress","image":"busybox:1.36","command":["sh","-c","while true; do :; done"],"resources":{"requests":{"cpu":"25m","memory":"32Mi"},"limits":{"cpu":"100m","memory":"64Mi"}}}]}}'

Unlike memory, exceeding CPU limits does not kill the container. The kernel throttles it instead (CFS bandwidth limiting). The pod stays running but performs poorly.

Disk Pressure

Fill a PVC to trigger disk pressure:

dd if=/dev/zero of=/data/fill bs=1M count=1024

If the node’s disk fills up, the kubelet starts evicting pods based on their ephemeral storage usage.

Graduating to Chaos Frameworks

This demo uses plain kubectl, which is good for learning. For production chaos engineering, consider:

LitmusChaos

Kubernetes-native chaos framework with CRDs
Pre-built experiments: pod-delete, container-kill, network-loss, disk-fill
ChaosHub with community experiments
Integrates with CI/CD for automated chaos testing

Chaos Mesh

Similar to LitmusChaos, also CRD-based
Strong network chaos support (delay, partition, bandwidth)
Dashboard for experiment management
Fine-grained scheduling (cron-based chaos)

Gremlin

Commercial platform with a free tier
Agent-based (no CRDs required)
Supports non-Kubernetes targets (VMs, containers, bare metal)
Built-in safety features (blast radius limits, automatic rollback)

Building a Chaos Practice

Start with known-good systems - run chaos on systems that should be resilient, not ones you know are fragile
Minimize blast radius - target one pod or one namespace, not the whole cluster
Run in staging first - validate experiments before running them in production
Automate - chaos should be a regular, automated practice, not a one-time event
Track results - document what each experiment found and what was fixed
Game days - periodically run chaos experiments with the whole team watching, to practice incident response