Skip to content

StatefulSet: Deep Dive

This document explains why StatefulSets exist, how they guarantee ordering and identity, and when to choose them over Deployments or DaemonSets. It connects the mechanics you saw in the demo to the broader Kubernetes data model.


Deployments treat pods as interchangeable cattle. Every pod gets a random suffix. Storage is either shared or ephemeral. When a pod dies, its replacement has a new name and starts fresh.

This works for stateless HTTP servers. It breaks for anything that needs:

  • Stable network identity. A database replica must know its own address and the addresses of its peers.
  • Stable storage. Each replica needs its own persistent volume that survives restarts.
  • Ordered startup and shutdown. The primary must start before replicas. Replicas must drain before the primary shuts down.

StatefulSets provide all three guarantees.


In the demo, the Deployment creates pods with random suffixes:

counter-deploy-7b8f9c-abc12
counter-deploy-7b8f9c-xyz34

The StatefulSet creates pods with sequential, stable ordinals:

counter-sts-0
counter-sts-1

This naming follows the pattern <statefulset-name>-<ordinal>. The ordinal starts at 0 and increments. When counter-sts-0 is deleted, the replacement is also named counter-sts-0. The identity sticks.

From the demo manifest:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: counter-sts
namespace: statefulset-demo
spec:
serviceName: counter-sts
replicas: 2
selector:
matchLabels:
app: counter
variant: statefulset

The serviceName field is required. It points to a headless Service that governs DNS for the pods. More on that below.


By default, StatefulSet pods start in order. Pod-0 must be Running and Ready before Pod-1 begins. Pod-1 must be Ready before Pod-2, and so on. This is the OrderedReady pod management policy.

This matters for databases. PostgreSQL streaming replication requires the primary (pod-0) to be available before standby replicas can connect and begin replication.

Deletion proceeds in reverse order. Pod with the highest ordinal terminates first. This protects the primary (pod-0) from shutting down while replicas still depend on it.

If ordering is not required, you can opt out:

spec:
podManagementPolicy: Parallel

With Parallel, all pods start and stop simultaneously. Use this for workloads like Cassandra or Elasticsearch where nodes are truly equal peers with no startup dependency.


A normal ClusterIP Service assigns a virtual IP. Clients connect to the VIP and kube-proxy load-balances across pods. The client never knows which pod it hit.

A headless Service sets clusterIP: None. It assigns no VIP. Instead, DNS returns the IP addresses of individual pods.

From the demo:

apiVersion: v1
kind: Service
metadata:
name: counter-sts
namespace: statefulset-demo
spec:
clusterIP: None
selector:
app: counter
variant: statefulset
ports:
- port: 80
targetPort: 80

This headless Service creates DNS records for each pod:

counter-sts-0.counter-sts.statefulset-demo.svc.cluster.local
counter-sts-1.counter-sts.statefulset-demo.svc.cluster.local

The pattern is <pod-name>.<service-name>.<namespace>.svc.cluster.local.

Database clients can connect to a specific replica by DNS name. A streaming replication configuration can hard-code that the primary lives at db-0.db-headless.prod.svc.cluster.local. If pod-0 restarts, it gets the same DNS name and the same IP address (within the same Service).

A query to the headless Service name itself (counter-sts.statefulset-demo.svc.cluster.local) returns A records for all pod IPs. This gives clients a way to discover all replicas.

Kubernetes also creates SRV records for headless services. These include port information:

_http._tcp.counter-sts.statefulset-demo.svc.cluster.local

SRV records are useful for service discovery protocols that need to know both the hostname and the port.


The demo uses volumeClaimTemplates to automatically create one PVC per pod:

volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 64Mi

When the StatefulSet creates counter-sts-0, Kubernetes also creates a PVC named data-counter-sts-0. For counter-sts-1, it creates data-counter-sts-1. The naming follows <template-name>-<pod-name>.

Compare this to the Deployment version, which uses emptyDir:

volumes:
- name: data
emptyDir: {}

The emptyDir volume lives and dies with the pod. When the pod is deleted, the data is gone. The PVC from volumeClaimTemplates persists independently of the pod lifecycle.

You could also create PVCs manually and reference them in the StatefulSet spec. But this defeats the purpose. You would need to pre-create exactly the right number of PVCs, name them correctly, and manage their lifecycle yourself. volumeClaimTemplates automates all of this.

When a StatefulSet pod is deleted, its PVC is not deleted. When the pod comes back (same ordinal), it reattaches to the same PVC. This is why the demo’s boot counter increments after pod deletion instead of resetting to 1.

When you scale down from 3 replicas to 2, the PVC for pod-2 is retained. If you scale back up, pod-2 reattaches to its original PVC with all its data intact.


Kubernetes 1.27+ introduced persistentVolumeClaimRetentionPolicy to control what happens to PVCs when pods are deleted or the StatefulSet is scaled down.

spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete

Controls PVC behavior when the StatefulSet itself is deleted.

  • Retain (default): PVCs survive StatefulSet deletion. You must clean them up manually. This is the safe default for databases.
  • Delete: PVCs are deleted when the StatefulSet is deleted.

Controls PVC behavior when the StatefulSet is scaled down.

  • Retain (default): PVCs survive scale-down. Scaling back up reattaches them.
  • Delete: PVCs for removed pods are deleted during scale-down. This is useful for caches or temporary workloads where you do not want orphaned PVCs accumulating.

A PostgreSQL cluster might use:

persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain # Never lose database data
whenScaled: Retain # Keep data when scaling down temporarily

A Redis cache cluster might use:

persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete # Cache data is disposable
whenScaled: Delete # No point keeping stale cache volumes

StatefulSets support two update strategies: RollingUpdate and OnDelete.

Pods are updated in reverse ordinal order. Pod with the highest ordinal updates first, then the next highest, and so on.

spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0

The partition field is powerful. Only pods with an ordinal greater than or equal to the partition value are updated. For example, with partition: 2 and replicas 0-4, only pods 2, 3, and 4 receive the update. Pods 0 and 1 keep the old version.

This enables canary deployments for stateful workloads. Set partition: 4 to update only pod-4. Verify it works. Lower the partition to 3, then 2, and so on.

Pods are only updated when you manually delete them. The controller creates the replacement with the new spec.

spec:
updateStrategy:
type: OnDelete

This gives you full control over the update sequence. It is the safest option for critical databases where you want to manually verify each replica after an upgrade.

Starting in Kubernetes 1.24, RollingUpdate for StatefulSets supports maxUnavailable:

spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2

By default, maxUnavailable is 1, meaning pods update one at a time. Increasing this speeds up rolling updates for large StatefulSets (like a 20-node Elasticsearch cluster).


The demo uses an init container for one-time setup:

initContainers:
- name: setup
image: nginx:1.25.3-alpine
command: ["/bin/sh", "/scripts/start.sh"]
volumeMounts:
- name: data
mountPath: /data
- name: html
mountPath: /usr/share/nginx/html
- name: scripts
mountPath: /scripts

Init containers run before the main container starts. They run to completion and must exit 0 before the main container launches.

In real StatefulSet workloads, init containers commonly:

  • Seed initial data. Download a database snapshot on first boot.
  • Configure replication. Check the pod ordinal and set primary or replica configuration.
  • Wait for dependencies. Block until the primary pod is ready before starting a replica.

The pod ordinal is available via the hostname command. An init container can parse this to determine its role:

Terminal window
ORDINAL=$(hostname | rev | cut -d'-' -f1 | rev)
if [ "$ORDINAL" = "0" ]; then
echo "I am the primary"
else
echo "I am replica $ORDINAL"
fi

When to Use StatefulSet vs Deployment vs DaemonSet

Section titled “When to Use StatefulSet vs Deployment vs DaemonSet”
  • Pods are interchangeable (any pod can serve any request).
  • Storage is shared or ephemeral.
  • No startup ordering is needed.
  • Examples: web servers, API servers, workers processing from a shared queue.
  • Each pod needs a stable, unique identity.
  • Each pod needs its own dedicated persistent volume.
  • Startup and shutdown order matters.
  • Peers need to address each other by stable DNS names.
  • Examples: PostgreSQL, MySQL, MongoDB, Kafka, ZooKeeper, etcd, Redis Cluster, Elasticsearch.
  • Exactly one pod must run on every node (or a subset of nodes).
  • The workload is node-level infrastructure, not application-level.
  • Examples: log collectors (Fluentd), monitoring agents (Prometheus node-exporter), CNI plugins (Calico), storage drivers (CSI node plugins).
RequirementDeploymentStatefulSetDaemonSet
Random pod names OKYesNoYes
Stable per-pod storageNoYesNo
Ordered startup/shutdownNoYesNo
One pod per nodeNoNoYes
Stable DNS per podNoYesNo
Rolling updates with partitionsNoYesNo
Scales horizontally by replica countYesYesNo

How the StatefulSet Controller Works Internally

Section titled “How the StatefulSet Controller Works Internally”

The StatefulSet controller runs inside kube-controller-manager. It watches for StatefulSet objects and their owned pods.

  1. Observe. The controller lists all pods owned by the StatefulSet.
  2. Sort. Pods are sorted by ordinal.
  3. Reconcile. The controller walks through each ordinal from 0 to N-1.
    • If the pod does not exist, create it.
    • If the pod exists but is not Ready, wait (in OrderedReady mode).
    • If the pod’s spec does not match the desired spec, update it (depending on the update strategy).
  4. Scale down. If there are more pods than desired replicas, delete pods in reverse ordinal order.
  5. PVC management. For each pod, ensure the corresponding PVCs exist. If using volumeClaimTemplates, create PVCs as needed.

The controller re-runs this loop whenever a StatefulSet or its owned pods change. It is level-triggered, not edge-triggered. It reacts to the current state, not to individual events.


A StatefulSet without a headless Service will fail. The serviceName field must reference a headless Service that exists. Without it, pods will not get their stable DNS names.

If your cluster does not have a default StorageClass, volumeClaimTemplates will create PVCs that stay in Pending state forever. Always verify a default StorageClass exists or specify one explicitly:

volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

StatefulSet pods honor terminationGracePeriodSeconds strictly. If a pod’s shutdown hook takes too long, the pod stays in Terminating state and blocks the next pod in the ordinal sequence from starting. Set a reasonable grace period and ensure your application handles SIGTERM promptly.

Scale-Down Does Not Delete PVCs (by Default)

Section titled “Scale-Down Does Not Delete PVCs (by Default)”

When you scale a StatefulSet from 5 to 3, PVCs for pods 3 and 4 remain. Over time, orphaned PVCs accumulate and consume storage. Either use whenScaled: Delete or build a cleanup process.


Many operators use pod-0 as the default leader. Init containers check if their ordinal is 0 and configure accordingly. This avoids the complexity of distributed leader election for simple setups.

In production, you want database replicas on different nodes:

spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: counter
topologyKey: kubernetes.io/hostname

This prevents two pods from the same StatefulSet from landing on the same node.

Stateful workloads often need a preStop hook to drain connections or flush writes:

lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "pg_ctl stop -m fast"]

This ensures the database shuts down cleanly before the container is killed.


The demo makes these concepts concrete:

  1. Identity. The Deployment’s pods get random names. The StatefulSet’s pods are always counter-sts-0 and counter-sts-1.
  2. Storage. Deleting a Deployment pod resets the boot counter to 1. Deleting a StatefulSet pod preserves the counter because the PVC reattaches.
  3. DNS. The headless Service counter-sts creates per-pod DNS records that you can test with wget from a debug pod.
  4. Ordering. Scaling up from 2 to 4 replicas creates pods in order: pod-2 first, then pod-3. Scaling back down removes pod-3 first, then pod-2.