StatefulSet: Deep Dive

This document explains why StatefulSets exist, how they guarantee ordering and identity, and when to choose them over Deployments or DaemonSets. It connects the mechanics you saw in the demo to the broader Kubernetes data model.

The Core Problem StatefulSets Solve

Deployments treat pods as interchangeable cattle. Every pod gets a random suffix. Storage is either shared or ephemeral. When a pod dies, its replacement has a new name and starts fresh.

This works for stateless HTTP servers. It breaks for anything that needs:

Stable network identity. A database replica must know its own address and the addresses of its peers.
Stable storage. Each replica needs its own persistent volume that survives restarts.
Ordered startup and shutdown. The primary must start before replicas. Replicas must drain before the primary shuts down.

StatefulSets provide all three guarantees.

Pod Identity and Naming

In the demo, the Deployment creates pods with random suffixes:

counter-deploy-7b8f9c-abc12
counter-deploy-7b8f9c-xyz34

The StatefulSet creates pods with sequential, stable ordinals:

counter-sts-0
counter-sts-1

This naming follows the pattern <statefulset-name>-<ordinal>. The ordinal starts at 0 and increments. When counter-sts-0 is deleted, the replacement is also named counter-sts-0. The identity sticks.

From the demo manifest:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: counter-sts
  namespace: statefulset-demo
spec:
  serviceName: counter-sts
  replicas: 2
  selector:
    matchLabels:
      app: counter
      variant: statefulset

The serviceName field is required. It points to a headless Service that governs DNS for the pods. More on that below.

Ordering Guarantees

Startup Order

By default, StatefulSet pods start in order. Pod-0 must be Running and Ready before Pod-1 begins. Pod-1 must be Ready before Pod-2, and so on. This is the OrderedReady pod management policy.

This matters for databases. PostgreSQL streaming replication requires the primary (pod-0) to be available before standby replicas can connect and begin replication.

Shutdown Order

Deletion proceeds in reverse order. Pod with the highest ordinal terminates first. This protects the primary (pod-0) from shutting down while replicas still depend on it.

Parallel Pod Management

If ordering is not required, you can opt out:

spec:
  podManagementPolicy: Parallel

With Parallel, all pods start and stop simultaneously. Use this for workloads like Cassandra or Elasticsearch where nodes are truly equal peers with no startup dependency.

Headless Services and DNS

A normal ClusterIP Service assigns a virtual IP. Clients connect to the VIP and kube-proxy load-balances across pods. The client never knows which pod it hit.

A headless Service sets clusterIP: None. It assigns no VIP. Instead, DNS returns the IP addresses of individual pods.

From the demo:

apiVersion: v1
kind: Service
metadata:
  name: counter-sts
  namespace: statefulset-demo
spec:
  clusterIP: None
  selector:
    app: counter
    variant: statefulset
  ports:
    - port: 80
      targetPort: 80

This headless Service creates DNS records for each pod:

counter-sts-0.counter-sts.statefulset-demo.svc.cluster.local
counter-sts-1.counter-sts.statefulset-demo.svc.cluster.local

The pattern is <pod-name>.<service-name>.<namespace>.svc.cluster.local.

Why This Matters

Database clients can connect to a specific replica by DNS name. A streaming replication configuration can hard-code that the primary lives at db-0.db-headless.prod.svc.cluster.local. If pod-0 restarts, it gets the same DNS name and the same IP address (within the same Service).

A query to the headless Service name itself (counter-sts.statefulset-demo.svc.cluster.local) returns A records for all pod IPs. This gives clients a way to discover all replicas.

SRV Records

Kubernetes also creates SRV records for headless services. These include port information:

_http._tcp.counter-sts.statefulset-demo.svc.cluster.local

SRV records are useful for service discovery protocols that need to know both the hostname and the port.

volumeClaimTemplates vs Manual PVCs

The Template Approach

The demo uses volumeClaimTemplates to automatically create one PVC per pod:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 64Mi

When the StatefulSet creates counter-sts-0, Kubernetes also creates a PVC named data-counter-sts-0. For counter-sts-1, it creates data-counter-sts-1. The naming follows <template-name>-<pod-name>.

Compare this to the Deployment version, which uses emptyDir:

volumes:
  - name: data
    emptyDir: {}

The emptyDir volume lives and dies with the pod. When the pod is deleted, the data is gone. The PVC from volumeClaimTemplates persists independently of the pod lifecycle.

Manual PVCs

You could also create PVCs manually and reference them in the StatefulSet spec. But this defeats the purpose. You would need to pre-create exactly the right number of PVCs, name them correctly, and manage their lifecycle yourself. volumeClaimTemplates automates all of this.

PVC Lifecycle

When a StatefulSet pod is deleted, its PVC is not deleted. When the pod comes back (same ordinal), it reattaches to the same PVC. This is why the demo’s boot counter increments after pod deletion instead of resetting to 1.

When you scale down from 3 replicas to 2, the PVC for pod-2 is retained. If you scale back up, pod-2 reattaches to its original PVC with all its data intact.

PVC Retention Policies

Kubernetes 1.27+ introduced persistentVolumeClaimRetentionPolicy to control what happens to PVCs when pods are deleted or the StatefulSet is scaled down.

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Delete

whenDeleted

Controls PVC behavior when the StatefulSet itself is deleted.

Retain (default): PVCs survive StatefulSet deletion. You must clean them up manually. This is the safe default for databases.
Delete: PVCs are deleted when the StatefulSet is deleted.

whenScaled

Controls PVC behavior when the StatefulSet is scaled down.

Retain (default): PVCs survive scale-down. Scaling back up reattaches them.
Delete: PVCs for removed pods are deleted during scale-down. This is useful for caches or temporary workloads where you do not want orphaned PVCs accumulating.

Practical Example

A PostgreSQL cluster might use:

persistentVolumeClaimRetentionPolicy:
  whenDeleted: Retain    # Never lose database data
  whenScaled: Retain     # Keep data when scaling down temporarily

A Redis cache cluster might use:

persistentVolumeClaimRetentionPolicy:
  whenDeleted: Delete    # Cache data is disposable
  whenScaled: Delete     # No point keeping stale cache volumes

Update Strategies

StatefulSets support two update strategies: RollingUpdate and OnDelete.

RollingUpdate (Default)

Pods are updated in reverse ordinal order. Pod with the highest ordinal updates first, then the next highest, and so on.

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0

The partition field is powerful. Only pods with an ordinal greater than or equal to the partition value are updated. For example, with partition: 2 and replicas 0-4, only pods 2, 3, and 4 receive the update. Pods 0 and 1 keep the old version.

This enables canary deployments for stateful workloads. Set partition: 4 to update only pod-4. Verify it works. Lower the partition to 3, then 2, and so on.

OnDelete

Pods are only updated when you manually delete them. The controller creates the replacement with the new spec.

spec:
  updateStrategy:
    type: OnDelete

This gives you full control over the update sequence. It is the safest option for critical databases where you want to manually verify each replica after an upgrade.

maxUnavailable

Starting in Kubernetes 1.24, RollingUpdate for StatefulSets supports maxUnavailable:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2

By default, maxUnavailable is 1, meaning pods update one at a time. Increasing this speeds up rolling updates for large StatefulSets (like a 20-node Elasticsearch cluster).

Init Containers in StatefulSets

The demo uses an init container for one-time setup:

initContainers:
  - name: setup
    image: nginx:1.25.3-alpine
    command: ["/bin/sh", "/scripts/start.sh"]
    volumeMounts:
      - name: data
        mountPath: /data
      - name: html
        mountPath: /usr/share/nginx/html
      - name: scripts
        mountPath: /scripts

Init containers run before the main container starts. They run to completion and must exit 0 before the main container launches.

In real StatefulSet workloads, init containers commonly:

Seed initial data. Download a database snapshot on first boot.
Configure replication. Check the pod ordinal and set primary or replica configuration.
Wait for dependencies. Block until the primary pod is ready before starting a replica.

The pod ordinal is available via the hostname command. An init container can parse this to determine its role:

ORDINAL=$(hostname | rev | cut -d'-' -f1 | rev)
if [ "$ORDINAL" = "0" ]; then
  echo "I am the primary"
else
  echo "I am replica $ORDINAL"
fi

When to Use StatefulSet vs Deployment vs DaemonSet

Use a Deployment When

Pods are interchangeable (any pod can serve any request).
Storage is shared or ephemeral.
No startup ordering is needed.
Examples: web servers, API servers, workers processing from a shared queue.

Use a StatefulSet When

Each pod needs a stable, unique identity.
Each pod needs its own dedicated persistent volume.
Startup and shutdown order matters.
Peers need to address each other by stable DNS names.
Examples: PostgreSQL, MySQL, MongoDB, Kafka, ZooKeeper, etcd, Redis Cluster, Elasticsearch.

Use a DaemonSet When

Exactly one pod must run on every node (or a subset of nodes).
The workload is node-level infrastructure, not application-level.
Examples: log collectors (Fluentd), monitoring agents (Prometheus node-exporter), CNI plugins (Calico), storage drivers (CSI node plugins).

Decision Table

Requirement	Deployment	StatefulSet	DaemonSet
Random pod names OK	Yes	No	Yes
Stable per-pod storage	No	Yes	No
Ordered startup/shutdown	No	Yes	No
One pod per node	No	No	Yes
Stable DNS per pod	No	Yes	No
Rolling updates with partitions	No	Yes	No
Scales horizontally by replica count	Yes	Yes	No

How the StatefulSet Controller Works Internally

The StatefulSet controller runs inside kube-controller-manager. It watches for StatefulSet objects and their owned pods.

Observe. The controller lists all pods owned by the StatefulSet.
Sort. Pods are sorted by ordinal.
Reconcile. The controller walks through each ordinal from 0 to N-1.
- If the pod does not exist, create it.
- If the pod exists but is not Ready, wait (in OrderedReady mode).
- If the pod’s spec does not match the desired spec, update it (depending on the update strategy).
Scale down. If there are more pods than desired replicas, delete pods in reverse ordinal order.
PVC management. For each pod, ensure the corresponding PVCs exist. If using volumeClaimTemplates, create PVCs as needed.

The controller re-runs this loop whenever a StatefulSet or its owned pods change. It is level-triggered, not edge-triggered. It reacts to the current state, not to individual events.

Common Pitfalls

Forgetting the Headless Service

A StatefulSet without a headless Service will fail. The serviceName field must reference a headless Service that exists. Without it, pods will not get their stable DNS names.

PVC Storage Class

If your cluster does not have a default StorageClass, volumeClaimTemplates will create PVCs that stay in Pending state forever. Always verify a default StorageClass exists or specify one explicitly:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: standard
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

Pod Stuck in Terminating

StatefulSet pods honor terminationGracePeriodSeconds strictly. If a pod’s shutdown hook takes too long, the pod stays in Terminating state and blocks the next pod in the ordinal sequence from starting. Set a reasonable grace period and ensure your application handles SIGTERM promptly.

Scale-Down Does Not Delete PVCs (by Default)

When you scale a StatefulSet from 5 to 3, PVCs for pods 3 and 4 remain. Over time, orphaned PVCs accumulate and consume storage. Either use whenScaled: Delete or build a cleanup process.

Real-World StatefulSet Patterns

Leader Election with Pod Ordinals

Many operators use pod-0 as the default leader. Init containers check if their ordinal is 0 and configure accordingly. This avoids the complexity of distributed leader election for simple setups.

Anti-Affinity for Spreading Replicas

In production, you want database replicas on different nodes:

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: counter
              topologyKey: kubernetes.io/hostname

This prevents two pods from the same StatefulSet from landing on the same node.

Graceful Shutdown Hooks

Stateful workloads often need a preStop hook to drain connections or flush writes:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "pg_ctl stop -m fast"]

This ensures the database shuts down cleanly before the container is killed.

Connection to the Demo

The demo makes these concepts concrete:

Identity. The Deployment’s pods get random names. The StatefulSet’s pods are always counter-sts-0 and counter-sts-1.
Storage. Deleting a Deployment pod resets the boot counter to 1. Deleting a StatefulSet pod preserves the counter because the PVC reattaches.
DNS. The headless Service counter-sts creates per-pod DNS records that you can test with wget from a debug pod.
Ordering. Scaling up from 2 to 4 replicas creates pods in order: pod-2 first, then pod-3. Scaling back down removes pod-3 first, then pod-2.