StatefulSet: Deep Dive
This document explains why StatefulSets exist, how they guarantee ordering and identity, and when to choose them over Deployments or DaemonSets. It connects the mechanics you saw in the demo to the broader Kubernetes data model.
The Core Problem StatefulSets Solve
Section titled “The Core Problem StatefulSets Solve”Deployments treat pods as interchangeable cattle. Every pod gets a random suffix. Storage is either shared or ephemeral. When a pod dies, its replacement has a new name and starts fresh.
This works for stateless HTTP servers. It breaks for anything that needs:
- Stable network identity. A database replica must know its own address and the addresses of its peers.
- Stable storage. Each replica needs its own persistent volume that survives restarts.
- Ordered startup and shutdown. The primary must start before replicas. Replicas must drain before the primary shuts down.
StatefulSets provide all three guarantees.
Pod Identity and Naming
Section titled “Pod Identity and Naming”In the demo, the Deployment creates pods with random suffixes:
counter-deploy-7b8f9c-abc12counter-deploy-7b8f9c-xyz34The StatefulSet creates pods with sequential, stable ordinals:
counter-sts-0counter-sts-1This naming follows the pattern <statefulset-name>-<ordinal>. The ordinal starts at 0 and
increments. When counter-sts-0 is deleted, the replacement is also named counter-sts-0.
The identity sticks.
From the demo manifest:
apiVersion: apps/v1kind: StatefulSetmetadata: name: counter-sts namespace: statefulset-demospec: serviceName: counter-sts replicas: 2 selector: matchLabels: app: counter variant: statefulsetThe serviceName field is required. It points to a headless Service that governs DNS for the
pods. More on that below.
Ordering Guarantees
Section titled “Ordering Guarantees”Startup Order
Section titled “Startup Order”By default, StatefulSet pods start in order. Pod-0 must be Running and Ready before Pod-1
begins. Pod-1 must be Ready before Pod-2, and so on. This is the OrderedReady pod management
policy.
This matters for databases. PostgreSQL streaming replication requires the primary (pod-0) to be available before standby replicas can connect and begin replication.
Shutdown Order
Section titled “Shutdown Order”Deletion proceeds in reverse order. Pod with the highest ordinal terminates first. This protects the primary (pod-0) from shutting down while replicas still depend on it.
Parallel Pod Management
Section titled “Parallel Pod Management”If ordering is not required, you can opt out:
spec: podManagementPolicy: ParallelWith Parallel, all pods start and stop simultaneously. Use this for workloads like Cassandra
or Elasticsearch where nodes are truly equal peers with no startup dependency.
Headless Services and DNS
Section titled “Headless Services and DNS”A normal ClusterIP Service assigns a virtual IP. Clients connect to the VIP and kube-proxy load-balances across pods. The client never knows which pod it hit.
A headless Service sets clusterIP: None. It assigns no VIP. Instead, DNS returns the IP
addresses of individual pods.
From the demo:
apiVersion: v1kind: Servicemetadata: name: counter-sts namespace: statefulset-demospec: clusterIP: None selector: app: counter variant: statefulset ports: - port: 80 targetPort: 80This headless Service creates DNS records for each pod:
counter-sts-0.counter-sts.statefulset-demo.svc.cluster.localcounter-sts-1.counter-sts.statefulset-demo.svc.cluster.localThe pattern is <pod-name>.<service-name>.<namespace>.svc.cluster.local.
Why This Matters
Section titled “Why This Matters”Database clients can connect to a specific replica by DNS name. A streaming replication
configuration can hard-code that the primary lives at db-0.db-headless.prod.svc.cluster.local.
If pod-0 restarts, it gets the same DNS name and the same IP address (within the same Service).
A query to the headless Service name itself (counter-sts.statefulset-demo.svc.cluster.local)
returns A records for all pod IPs. This gives clients a way to discover all replicas.
SRV Records
Section titled “SRV Records”Kubernetes also creates SRV records for headless services. These include port information:
_http._tcp.counter-sts.statefulset-demo.svc.cluster.localSRV records are useful for service discovery protocols that need to know both the hostname and the port.
volumeClaimTemplates vs Manual PVCs
Section titled “volumeClaimTemplates vs Manual PVCs”The Template Approach
Section titled “The Template Approach”The demo uses volumeClaimTemplates to automatically create one PVC per pod:
volumeClaimTemplates: - metadata: name: data spec: accessModes: - ReadWriteOnce resources: requests: storage: 64MiWhen the StatefulSet creates counter-sts-0, Kubernetes also creates a PVC named
data-counter-sts-0. For counter-sts-1, it creates data-counter-sts-1. The naming follows
<template-name>-<pod-name>.
Compare this to the Deployment version, which uses emptyDir:
volumes: - name: data emptyDir: {}The emptyDir volume lives and dies with the pod. When the pod is deleted, the data is gone.
The PVC from volumeClaimTemplates persists independently of the pod lifecycle.
Manual PVCs
Section titled “Manual PVCs”You could also create PVCs manually and reference them in the StatefulSet spec. But this
defeats the purpose. You would need to pre-create exactly the right number of PVCs, name them
correctly, and manage their lifecycle yourself. volumeClaimTemplates automates all of this.
PVC Lifecycle
Section titled “PVC Lifecycle”When a StatefulSet pod is deleted, its PVC is not deleted. When the pod comes back (same ordinal), it reattaches to the same PVC. This is why the demo’s boot counter increments after pod deletion instead of resetting to 1.
When you scale down from 3 replicas to 2, the PVC for pod-2 is retained. If you scale back up, pod-2 reattaches to its original PVC with all its data intact.
PVC Retention Policies
Section titled “PVC Retention Policies”Kubernetes 1.27+ introduced persistentVolumeClaimRetentionPolicy to control what happens to
PVCs when pods are deleted or the StatefulSet is scaled down.
spec: persistentVolumeClaimRetentionPolicy: whenDeleted: Retain whenScaled: DeletewhenDeleted
Section titled “whenDeleted”Controls PVC behavior when the StatefulSet itself is deleted.
Retain(default): PVCs survive StatefulSet deletion. You must clean them up manually. This is the safe default for databases.Delete: PVCs are deleted when the StatefulSet is deleted.
whenScaled
Section titled “whenScaled”Controls PVC behavior when the StatefulSet is scaled down.
Retain(default): PVCs survive scale-down. Scaling back up reattaches them.Delete: PVCs for removed pods are deleted during scale-down. This is useful for caches or temporary workloads where you do not want orphaned PVCs accumulating.
Practical Example
Section titled “Practical Example”A PostgreSQL cluster might use:
persistentVolumeClaimRetentionPolicy: whenDeleted: Retain # Never lose database data whenScaled: Retain # Keep data when scaling down temporarilyA Redis cache cluster might use:
persistentVolumeClaimRetentionPolicy: whenDeleted: Delete # Cache data is disposable whenScaled: Delete # No point keeping stale cache volumesUpdate Strategies
Section titled “Update Strategies”StatefulSets support two update strategies: RollingUpdate and OnDelete.
RollingUpdate (Default)
Section titled “RollingUpdate (Default)”Pods are updated in reverse ordinal order. Pod with the highest ordinal updates first, then the next highest, and so on.
spec: updateStrategy: type: RollingUpdate rollingUpdate: partition: 0The partition field is powerful. Only pods with an ordinal greater than or equal to the
partition value are updated. For example, with partition: 2 and replicas 0-4, only pods 2, 3,
and 4 receive the update. Pods 0 and 1 keep the old version.
This enables canary deployments for stateful workloads. Set partition: 4 to update only
pod-4. Verify it works. Lower the partition to 3, then 2, and so on.
OnDelete
Section titled “OnDelete”Pods are only updated when you manually delete them. The controller creates the replacement with the new spec.
spec: updateStrategy: type: OnDeleteThis gives you full control over the update sequence. It is the safest option for critical databases where you want to manually verify each replica after an upgrade.
maxUnavailable
Section titled “maxUnavailable”Starting in Kubernetes 1.24, RollingUpdate for StatefulSets supports maxUnavailable:
spec: updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 2By default, maxUnavailable is 1, meaning pods update one at a time. Increasing this speeds
up rolling updates for large StatefulSets (like a 20-node Elasticsearch cluster).
Init Containers in StatefulSets
Section titled “Init Containers in StatefulSets”The demo uses an init container for one-time setup:
initContainers: - name: setup image: nginx:1.25.3-alpine command: ["/bin/sh", "/scripts/start.sh"] volumeMounts: - name: data mountPath: /data - name: html mountPath: /usr/share/nginx/html - name: scripts mountPath: /scriptsInit containers run before the main container starts. They run to completion and must exit 0 before the main container launches.
In real StatefulSet workloads, init containers commonly:
- Seed initial data. Download a database snapshot on first boot.
- Configure replication. Check the pod ordinal and set
primaryorreplicaconfiguration. - Wait for dependencies. Block until the primary pod is ready before starting a replica.
The pod ordinal is available via the hostname command. An init container can parse this to
determine its role:
ORDINAL=$(hostname | rev | cut -d'-' -f1 | rev)if [ "$ORDINAL" = "0" ]; then echo "I am the primary"else echo "I am replica $ORDINAL"fiWhen to Use StatefulSet vs Deployment vs DaemonSet
Section titled “When to Use StatefulSet vs Deployment vs DaemonSet”Use a Deployment When
Section titled “Use a Deployment When”- Pods are interchangeable (any pod can serve any request).
- Storage is shared or ephemeral.
- No startup ordering is needed.
- Examples: web servers, API servers, workers processing from a shared queue.
Use a StatefulSet When
Section titled “Use a StatefulSet When”- Each pod needs a stable, unique identity.
- Each pod needs its own dedicated persistent volume.
- Startup and shutdown order matters.
- Peers need to address each other by stable DNS names.
- Examples: PostgreSQL, MySQL, MongoDB, Kafka, ZooKeeper, etcd, Redis Cluster, Elasticsearch.
Use a DaemonSet When
Section titled “Use a DaemonSet When”- Exactly one pod must run on every node (or a subset of nodes).
- The workload is node-level infrastructure, not application-level.
- Examples: log collectors (Fluentd), monitoring agents (Prometheus node-exporter), CNI plugins (Calico), storage drivers (CSI node plugins).
Decision Table
Section titled “Decision Table”| Requirement | Deployment | StatefulSet | DaemonSet |
|---|---|---|---|
| Random pod names OK | Yes | No | Yes |
| Stable per-pod storage | No | Yes | No |
| Ordered startup/shutdown | No | Yes | No |
| One pod per node | No | No | Yes |
| Stable DNS per pod | No | Yes | No |
| Rolling updates with partitions | No | Yes | No |
| Scales horizontally by replica count | Yes | Yes | No |
How the StatefulSet Controller Works Internally
Section titled “How the StatefulSet Controller Works Internally”The StatefulSet controller runs inside kube-controller-manager. It watches for StatefulSet objects and their owned pods.
- Observe. The controller lists all pods owned by the StatefulSet.
- Sort. Pods are sorted by ordinal.
- Reconcile. The controller walks through each ordinal from 0 to N-1.
- If the pod does not exist, create it.
- If the pod exists but is not Ready, wait (in
OrderedReadymode). - If the pod’s spec does not match the desired spec, update it (depending on the update strategy).
- Scale down. If there are more pods than desired replicas, delete pods in reverse ordinal order.
- PVC management. For each pod, ensure the corresponding PVCs exist. If using
volumeClaimTemplates, create PVCs as needed.
The controller re-runs this loop whenever a StatefulSet or its owned pods change. It is level-triggered, not edge-triggered. It reacts to the current state, not to individual events.
Common Pitfalls
Section titled “Common Pitfalls”Forgetting the Headless Service
Section titled “Forgetting the Headless Service”A StatefulSet without a headless Service will fail. The serviceName field must reference a
headless Service that exists. Without it, pods will not get their stable DNS names.
PVC Storage Class
Section titled “PVC Storage Class”If your cluster does not have a default StorageClass, volumeClaimTemplates will create PVCs
that stay in Pending state forever. Always verify a default StorageClass exists or specify
one explicitly:
volumeClaimTemplates: - metadata: name: data spec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 10GiPod Stuck in Terminating
Section titled “Pod Stuck in Terminating”StatefulSet pods honor terminationGracePeriodSeconds strictly. If a pod’s shutdown hook takes
too long, the pod stays in Terminating state and blocks the next pod in the ordinal sequence
from starting. Set a reasonable grace period and ensure your application handles SIGTERM
promptly.
Scale-Down Does Not Delete PVCs (by Default)
Section titled “Scale-Down Does Not Delete PVCs (by Default)”When you scale a StatefulSet from 5 to 3, PVCs for pods 3 and 4 remain. Over time, orphaned
PVCs accumulate and consume storage. Either use whenScaled: Delete or build a cleanup
process.
Real-World StatefulSet Patterns
Section titled “Real-World StatefulSet Patterns”Leader Election with Pod Ordinals
Section titled “Leader Election with Pod Ordinals”Many operators use pod-0 as the default leader. Init containers check if their ordinal is 0 and configure accordingly. This avoids the complexity of distributed leader election for simple setups.
Anti-Affinity for Spreading Replicas
Section titled “Anti-Affinity for Spreading Replicas”In production, you want database replicas on different nodes:
spec: template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: counter topologyKey: kubernetes.io/hostnameThis prevents two pods from the same StatefulSet from landing on the same node.
Graceful Shutdown Hooks
Section titled “Graceful Shutdown Hooks”Stateful workloads often need a preStop hook to drain connections or flush writes:
lifecycle: preStop: exec: command: ["/bin/sh", "-c", "pg_ctl stop -m fast"]This ensures the database shuts down cleanly before the container is killed.
Connection to the Demo
Section titled “Connection to the Demo”The demo makes these concepts concrete:
- Identity. The Deployment’s pods get random names. The StatefulSet’s pods are always
counter-sts-0andcounter-sts-1. - Storage. Deleting a Deployment pod resets the boot counter to 1. Deleting a StatefulSet pod preserves the counter because the PVC reattaches.
- DNS. The headless Service
counter-stscreates per-pod DNS records that you can test withwgetfrom a debug pod. - Ordering. Scaling up from 2 to 4 replicas creates pods in order: pod-2 first, then pod-3. Scaling back down removes pod-3 first, then pod-2.
Further Reading
Section titled “Further Reading”- Kubernetes StatefulSet documentation
- StatefulSet Basics tutorial
- PVC Retention Policy KEP
- Headless Services