Chaos Engineering: Deep Dive
Principles of Chaos Engineering
Section titled “Principles of Chaos Engineering”Chaos engineering is the practice of intentionally introducing failures into a system to verify that it handles them gracefully. The goal is not to break things for fun, but to find weaknesses before they cause real outages.
The discipline was formalized by Netflix (Chaos Monkey, 2011) and follows four principles:
- Define steady state - what does “working” look like? (e.g., all pods running, latency under 200ms)
- Hypothesize - “if I kill a pod, the Deployment controller will replace it within 30 seconds”
- Introduce chaos - run the experiment
- Observe - did the system behave as expected? If not, you found a weakness to fix
How Kubernetes Self-Heals
Section titled “How Kubernetes Self-Heals”The Control Loop
Section titled “The Control Loop”Kubernetes controllers run continuous reconciliation loops:
Desired State (spec) ──> Compare ──> Actual State (status) │ Take corrective actionWhen you delete a pod, the Deployment controller detects that status.replicas < spec.replicas and creates a replacement. This loop runs every few seconds, so recovery is fast.
Pod Replacement Timing
Section titled “Pod Replacement Timing”When a pod is deleted:
- 0s - API server marks pod for deletion
- 0-2s - Deployment controller notices the discrepancy
- 2-5s - New pod is scheduled to a node
- 5-30s - Container image is pulled (if not cached)
- 30s+ - Readiness probe passes, pod receives traffic
With pre-pulled images, total recovery is typically under 10 seconds.
PodDisruptionBudgets in Detail
Section titled “PodDisruptionBudgets in Detail”A PDB declares the minimum number (or percentage) of pods that must remain available during voluntary disruptions. Voluntary disruptions include:
kubectl drain(node maintenance)kubectl delete pod(manual deletion)- Cluster autoscaler removing a node
- Deployment rolling updates
Involuntary disruptions (node crash, OOMKill, hardware failure) are NOT governed by PDBs. The PDB cannot prevent a kernel OOMKill.
PDB and Cluster Autoscaler
Section titled “PDB and Cluster Autoscaler”The cluster autoscaler respects PDBs when deciding whether to remove a node. If removing a node would violate a PDB (not enough pods would remain), the autoscaler skips that node. This can prevent scale-down, so set PDBs carefully:
minAvailable: 1on a 2-replica deployment means the autoscaler can remove nodes freely (as long as 1 pod stays)minAvailable: 100%blocks all voluntary disruptions, including node maintenance
PDB During Rolling Updates
Section titled “PDB During Rolling Updates”During a Deployment rolling update, Kubernetes respects the PDB. If maxUnavailable: 1 and minAvailable: 3 on a 4-replica deployment, the update proceeds one pod at a time, waiting for each new pod to become ready before terminating the next old pod.
Memory Limits and OOMKill
Section titled “Memory Limits and OOMKill”How the Kubelet Enforces Memory Limits
Section titled “How the Kubelet Enforces Memory Limits”The kubelet uses Linux cgroups to enforce container memory limits:
- Container starts with a cgroup memory limit set to
resources.limits.memory - As the container allocates memory, the kernel tracks usage against the cgroup limit
- When usage hits the limit, the kernel OOM killer terminates the process
- The kubelet records the exit code (137) and reason (OOMKilled)
Memory Request vs Limit
Section titled “Memory Request vs Limit”- Request - used for scheduling. The scheduler only places a pod on a node with enough allocatable memory.
- Limit - enforced at runtime via cgroups. Exceeding it triggers OOMKill.
Best practice: Set requests close to actual usage (for accurate scheduling) and limits at 1.5-2x requests (for burst headroom). If request equals limit, the pod gets Guaranteed QoS class.
QoS Classes and Eviction Order
Section titled “QoS Classes and Eviction Order”When a node runs low on memory, the kubelet evicts pods in this order:
- BestEffort - no requests or limits set (evicted first)
- Burstable - requests set but lower than limits
- Guaranteed - requests equal limits (evicted last)
Chaos Scenarios Beyond This Demo
Section titled “Chaos Scenarios Beyond This Demo”Network Chaos
Section titled “Network Chaos”Introducing network delays, packet loss, or partitions requires tools that can manipulate the Linux networking stack:
- tc (traffic control) -
tc qdisc add dev eth0 root netem delay 200ms 50msadds 200ms +/- 50ms latency - iptables -
iptables -A OUTPUT -d 10.0.0.0/8 -j DROPdrops all traffic to a subnet - Toxiproxy - application-level proxy that simulates network conditions
In Minikube, network chaos is limited because you would need privileged access to the node. In production, use a chaos framework.
CPU Stress
Section titled “CPU Stress”kubectl run cpu-stress --image=busybox:1.36 -n chaos-demo \ --overrides='{"spec":{"containers":[{"name":"cpu-stress","image":"busybox:1.36","command":["sh","-c","while true; do :; done"],"resources":{"requests":{"cpu":"25m","memory":"32Mi"},"limits":{"cpu":"100m","memory":"64Mi"}}}]}}'Unlike memory, exceeding CPU limits does not kill the container. The kernel throttles it instead (CFS bandwidth limiting). The pod stays running but performs poorly.
Disk Pressure
Section titled “Disk Pressure”Fill a PVC to trigger disk pressure:
dd if=/dev/zero of=/data/fill bs=1M count=1024If the node’s disk fills up, the kubelet starts evicting pods based on their ephemeral storage usage.
Graduating to Chaos Frameworks
Section titled “Graduating to Chaos Frameworks”This demo uses plain kubectl, which is good for learning. For production chaos engineering, consider:
LitmusChaos
Section titled “LitmusChaos”- Kubernetes-native chaos framework with CRDs
- Pre-built experiments: pod-delete, container-kill, network-loss, disk-fill
- ChaosHub with community experiments
- Integrates with CI/CD for automated chaos testing
Chaos Mesh
Section titled “Chaos Mesh”- Similar to LitmusChaos, also CRD-based
- Strong network chaos support (delay, partition, bandwidth)
- Dashboard for experiment management
- Fine-grained scheduling (cron-based chaos)
Gremlin
Section titled “Gremlin”- Commercial platform with a free tier
- Agent-based (no CRDs required)
- Supports non-Kubernetes targets (VMs, containers, bare metal)
- Built-in safety features (blast radius limits, automatic rollback)
Building a Chaos Practice
Section titled “Building a Chaos Practice”- Start with known-good systems - run chaos on systems that should be resilient, not ones you know are fragile
- Minimize blast radius - target one pod or one namespace, not the whole cluster
- Run in staging first - validate experiments before running them in production
- Automate - chaos should be a regular, automated practice, not a one-time event
- Track results - document what each experiment found and what was fixed
- Game days - periodically run chaos experiments with the whole team watching, to practice incident response