PersistentVolumes & StorageClasses: Deep Dive

This document explains how the Kubernetes storage layer works, from PV/PVC binding mechanics through CSI drivers, volume snapshots, and production storage patterns. It covers the “why” behind access modes, reclaim policies, topology-aware provisioning, and the trade-offs you face when choosing storage for real workloads.

The Storage Abstraction Stack

Kubernetes separates storage into three layers. Each has a distinct role.

StorageClass defines how storage is provisioned. It specifies the provisioner, parameters (IOPS, replication, filesystem type), and reclaim policy. Think of it as a storage template controlled by cluster administrators.

PersistentVolume (PV) is a piece of storage in the cluster. It exists independently of any pod. PVs can be created manually (static provisioning) or automatically by a StorageClass (dynamic provisioning).

PersistentVolumeClaim (PVC) is a request for storage by a pod. It specifies capacity, access mode, and optionally a StorageClass. Kubernetes binds the PVC to a matching PV.

StorageClass (how to provision)
    |
    v
PersistentVolume (a piece of storage)
    |
    v (bound)
PersistentVolumeClaim (a request for storage)
    |
    v (mounted)
Pod

This separation exists so developers request storage without knowing the backend, and administrators configure backends without knowing the applications. The PVC is the contract between the two.

PV/PVC Binding Mechanics

When a PVC is created, the control plane looks for a PV that satisfies all requirements. Binding considers several factors.

Capacity

The PV’s capacity must be greater than or equal to the PVC’s request. A PVC requesting 256Mi can bind to a 256Mi or 1Gi PV, but not a 128Mi PV.

From this demo’s dynamic PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dynamic-pvc
  namespace: storage-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 256Mi
  storageClassName: standard

Access Modes

The PV must support the access mode requested by the PVC. A PV offering only ReadWriteOnce cannot satisfy a PVC requesting ReadWriteMany.

Label Selectors

A PVC can use a label selector to target a specific PV. This is how static provisioning works in this demo:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
  labels:
    type: local
spec:
  capacity:
    storage: 128Mi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /tmp/manual-pv-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: manual-pvc
  namespace: storage-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 128Mi
  selector:
    matchLabels:
      type: local

The PVC uses selector.matchLabels to find only PVs labeled type: local. Without the selector, Kubernetes could bind to any available PV matching capacity and access mode.

StorageClass

If the PVC specifies a storageClassName, it only binds to PVs with the same class. Omitting storageClassName uses the cluster’s default StorageClass (if one exists). Setting storageClassName: "" explicitly opts out of dynamic provisioning.

Binding Is Exclusive

A PV binds to exactly one PVC. Once bound, the PV is reserved until the PVC is deleted. This is a 1:1 relationship.

Dynamic Provisioning with StorageClass

Static provisioning requires pre-creating PVs. Dynamic provisioning automates this. When a PVC references a StorageClass, the provisioner creates the PV automatically.

A provisioner watches for unbound PVCs. When it sees one requesting its StorageClass, it calls the storage API to create a volume, then creates a PV in Kubernetes. Minikube’s standard class uses the k8s.io/minikube-hostpath provisioner. Production provisioners include ebs.csi.aws.com, pd.csi.storage.gke.io, and disk.csi.azure.com.

A typical StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

The parameters field is passed directly to the provisioner. Different provisioners accept different parameters. These are storage-backend-specific, not Kubernetes-level concepts.

CSI Drivers

CSI (Container Storage Interface) is the standard plugin mechanism for storage in Kubernetes.

Why CSI Exists

Before CSI, storage drivers were compiled into Kubernetes. Adding a new storage system meant modifying Kubernetes source code and upgrading the cluster. CSI defines a standard gRPC interface so storage vendors develop, release, and update drivers independently.

How They Work

A CSI driver has two components. The controller plugin (Deployment or StatefulSet) handles volume lifecycle: create, delete, snapshot, expand. It talks to the storage backend’s API. The node plugin (DaemonSet on every node) handles mount, unmount, and format operations.

When a PVC triggers provisioning:

Controller plugin receives a CreateVolume RPC and calls the storage API.
Kubernetes creates a PV representing the new volume.
When a pod is scheduled, the node plugin receives NodeStageVolume (format and stage) then NodePublishVolume (bind-mount into the pod).

Driver	Backend	Use Case
`ebs.csi.aws.com`	AWS EBS	Block storage on AWS
`efs.csi.aws.com`	AWS EFS	Shared NFS on AWS
`pd.csi.storage.gke.io`	GCE PD	Block storage on GKE
`disk.csi.azure.com`	Azure Disk	Block storage on Azure
`rook-ceph.csi.ceph.com`	Ceph (via Rook)	Self-hosted distributed storage

Access Modes: What They Actually Mean

Access modes describe how a volume can be mounted by nodes. They do not enforce filesystem-level permissions. This distinction causes frequent confusion.

ReadWriteOnce (RWO)

The volume can be mounted read-write by a single node. Multiple pods on the same node can all mount it. Pods on different nodes cannot. This is the most common mode, mapping to block storage devices (EBS, GCE PD, Azure Disk) that attach to one instance at a time.

From this demo, the writer pod mounts a PVC read-write:

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: dynamic-pvc

The reader pod mounts a different PVC with readOnly: true on the volumeMount:

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: manual-pvc

The access mode on the PVC controls node-level attachment. The readOnly flag on the volumeMount controls container-level visibility. They are independent.

ReadOnlyMany (ROX)

The volume can be mounted read-only by many nodes simultaneously. Useful for shared configuration, static assets, or pre-built datasets.

ReadWriteMany (RWX)

The volume can be mounted read-write by many nodes simultaneously. Requires NFS, CephFS, Amazon EFS, or similar. Block storage does not support RWX.

RWX with concurrent writes requires the application to handle file locking. The storage system provides concurrent access, not concurrent safety. Two pods writing to the same file simultaneously will corrupt it without proper locking.

ReadWriteOncePod (RWOP)

Added in Kubernetes 1.27. Only one pod cluster-wide can mount the volume read-write. Stricter than RWO (which allows multiple pods on the same node). Useful for databases that assume exclusive write access.

Reclaim Policies

The reclaim policy determines what happens to the PV when its PVC is deleted.

Retain

The PV keeps its data and moves to Released state. An administrator must manually clean up. This demo’s static PV uses Retain:

spec:
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /tmp/manual-pv-data

Use Retain for production databases. Deleting a PVC accidentally will not destroy your data.

Delete

The PV and underlying storage are both deleted with the PVC. This is the default for dynamically provisioned volumes. Use it for caches, temporary pipelines, and reproducible data.

Recycle (Deprecated)

Ran rm -rf /volume/* and reused the PV. Deprecated because it was too simplistic and insecure. Use dynamic provisioning instead.

Volume Expansion

StorageClasses can allow resizing with allowVolumeExpansion: true:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: expandable
provisioner: ebs.csi.aws.com
allowVolumeExpansion: true

To expand: edit the PVC and increase spec.resources.requests.storage. The CSI driver handles the rest. Some drivers support online expansion. Others require a pod restart.

You can only expand, never shrink. This is a deliberate safety measure against data loss.

Volume Snapshots

Snapshots capture volume state at a point in time. They need a CSI driver with snapshot support and a VolumeSnapshotClass:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-ebs-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Delete

Create a snapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: my-data-snapshot
spec:
  volumeSnapshotClassName: csi-ebs-snapclass
  source:
    persistentVolumeClaimName: my-data-pvc

Restore by creating a PVC with dataSource:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restored-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 256Mi
  dataSource:
    name: my-data-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

The provisioner creates a new volume pre-populated with the snapshot’s data. This is the foundation for database backup/restore, disaster recovery, and environment cloning.

Topology-Aware Provisioning

In multi-zone clusters, where a volume is created matters. An EBS volume in us-east-1a cannot be attached to a node in us-east-1b.

The volumeBindingMode on StorageClass controls this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: topology-aware
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer

Immediate (default): PV is created as soon as the PVC is created. The provisioner picks a zone. If the pod lands in a different zone, it cannot mount the volume.

WaitForFirstConsumer: PV is not created until a pod using the PVC is scheduled. The provisioner creates the volume in the pod’s zone, guaranteeing accessibility.

For multi-zone clusters, WaitForFirstConsumer should be the default. Immediate only makes sense for single-zone clusters or zone-independent backends like NFS.

hostPath vs Local Volumes

hostPath

Mounts a host directory directly into the pod. Simple and works everywhere. This demo uses it:

spec:
  hostPath:
    path: /tmp/manual-pv-data

But it is dangerous in production. If the pod moves to a different node, it gets a different directory. There is no capacity enforcement. Security is a concern since pods can access any file on the node. Use hostPath only for single-node development like minikube.

Local Volumes

Similar to hostPath but managed as proper PVs. They are topology-aware (the scheduler knows which node has the storage), support capacity tracking, and work with the standard PVC lifecycle. Require mandatory nodeAffinity:

spec:
  local:
    path: /mnt/ssd0
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - worker-node-1

Use local volumes when you need local SSD performance (databases, caches) with proper lifecycle management. The trade-off: your pod is pinned to a specific node.

Production Storage Patterns

Databases

Use RWO or RWOP access, Retain reclaim, provisioned IOPS, WaitForFirstConsumer, and allowVolumeExpansion. StatefulSets give each replica its own PVC:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi

Each replica gets its own PVC (pgdata-postgres-0, pgdata-postgres-1, etc.). Deleting a StatefulSet does not delete its PVCs, protecting against accidental data loss.

Shared Files

RWX access via NFS, CephFS, or EFS. Higher latency than block storage, but works well for media files, uploads, and static assets. All replicas mount the same PVC.

Ephemeral Cache

Often emptyDir is enough. For PVC-backed ephemeral storage, use ephemeral volumes:

volumes:
  - name: scratch
    ephemeral:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 10Gi

Created with the pod, deleted with the pod. Useful for build caches, CI/CD temp files, and ML training scratch space.

Data Persistence Across Pod Restarts

This is the core value of PersistentVolumes. This demo shows it: the writer pod appends data, gets deleted, and a new pod reads the same data from the same PVC. Data survives because it lives on the PV, not in the container filesystem.

Choosing the Right Storage

Workload	Access Mode	Reclaim	Binding Mode	Volume Type
Single-instance DB	RWO/RWOP	Retain	WaitForFirstConsumer	Block (EBS, GCE PD)
Replicated DB	RWO	Retain	WaitForFirstConsumer	Block
Shared files	RWX	Retain	N/A	NFS, EFS, CephFS
Build cache	RWO	Delete	Immediate	Block or ephemeral
ML training data	ROX	Retain	N/A	NFS, S3 via CSI
Temp scratch	N/A	N/A	N/A	emptyDir