Skip to content

PersistentVolumes & StorageClasses: Deep Dive

This document explains how the Kubernetes storage layer works, from PV/PVC binding mechanics through CSI drivers, volume snapshots, and production storage patterns. It covers the “why” behind access modes, reclaim policies, topology-aware provisioning, and the trade-offs you face when choosing storage for real workloads.

Kubernetes separates storage into three layers. Each has a distinct role.

StorageClass defines how storage is provisioned. It specifies the provisioner, parameters (IOPS, replication, filesystem type), and reclaim policy. Think of it as a storage template controlled by cluster administrators.

PersistentVolume (PV) is a piece of storage in the cluster. It exists independently of any pod. PVs can be created manually (static provisioning) or automatically by a StorageClass (dynamic provisioning).

PersistentVolumeClaim (PVC) is a request for storage by a pod. It specifies capacity, access mode, and optionally a StorageClass. Kubernetes binds the PVC to a matching PV.

StorageClass (how to provision)
|
v
PersistentVolume (a piece of storage)
|
v (bound)
PersistentVolumeClaim (a request for storage)
|
v (mounted)
Pod

This separation exists so developers request storage without knowing the backend, and administrators configure backends without knowing the applications. The PVC is the contract between the two.

When a PVC is created, the control plane looks for a PV that satisfies all requirements. Binding considers several factors.

The PV’s capacity must be greater than or equal to the PVC’s request. A PVC requesting 256Mi can bind to a 256Mi or 1Gi PV, but not a 128Mi PV.

From this demo’s dynamic PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dynamic-pvc
namespace: storage-demo
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 256Mi
storageClassName: standard

The PV must support the access mode requested by the PVC. A PV offering only ReadWriteOnce cannot satisfy a PVC requesting ReadWriteMany.

A PVC can use a label selector to target a specific PV. This is how static provisioning works in this demo:

apiVersion: v1
kind: PersistentVolume
metadata:
name: manual-pv
labels:
type: local
spec:
capacity:
storage: 128Mi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: /tmp/manual-pv-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: manual-pvc
namespace: storage-demo
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 128Mi
selector:
matchLabels:
type: local

The PVC uses selector.matchLabels to find only PVs labeled type: local. Without the selector, Kubernetes could bind to any available PV matching capacity and access mode.

If the PVC specifies a storageClassName, it only binds to PVs with the same class. Omitting storageClassName uses the cluster’s default StorageClass (if one exists). Setting storageClassName: "" explicitly opts out of dynamic provisioning.

A PV binds to exactly one PVC. Once bound, the PV is reserved until the PVC is deleted. This is a 1:1 relationship.

Static provisioning requires pre-creating PVs. Dynamic provisioning automates this. When a PVC references a StorageClass, the provisioner creates the PV automatically.

A provisioner watches for unbound PVCs. When it sees one requesting its StorageClass, it calls the storage API to create a volume, then creates a PV in Kubernetes. Minikube’s standard class uses the k8s.io/minikube-hostpath provisioner. Production provisioners include ebs.csi.aws.com, pd.csi.storage.gke.io, and disk.csi.azure.com.

A typical StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

The parameters field is passed directly to the provisioner. Different provisioners accept different parameters. These are storage-backend-specific, not Kubernetes-level concepts.

CSI (Container Storage Interface) is the standard plugin mechanism for storage in Kubernetes.

Before CSI, storage drivers were compiled into Kubernetes. Adding a new storage system meant modifying Kubernetes source code and upgrading the cluster. CSI defines a standard gRPC interface so storage vendors develop, release, and update drivers independently.

A CSI driver has two components. The controller plugin (Deployment or StatefulSet) handles volume lifecycle: create, delete, snapshot, expand. It talks to the storage backend’s API. The node plugin (DaemonSet on every node) handles mount, unmount, and format operations.

When a PVC triggers provisioning:

  1. Controller plugin receives a CreateVolume RPC and calls the storage API.
  2. Kubernetes creates a PV representing the new volume.
  3. When a pod is scheduled, the node plugin receives NodeStageVolume (format and stage) then NodePublishVolume (bind-mount into the pod).
DriverBackendUse Case
ebs.csi.aws.comAWS EBSBlock storage on AWS
efs.csi.aws.comAWS EFSShared NFS on AWS
pd.csi.storage.gke.ioGCE PDBlock storage on GKE
disk.csi.azure.comAzure DiskBlock storage on Azure
rook-ceph.csi.ceph.comCeph (via Rook)Self-hosted distributed storage

Access modes describe how a volume can be mounted by nodes. They do not enforce filesystem-level permissions. This distinction causes frequent confusion.

The volume can be mounted read-write by a single node. Multiple pods on the same node can all mount it. Pods on different nodes cannot. This is the most common mode, mapping to block storage devices (EBS, GCE PD, Azure Disk) that attach to one instance at a time.

From this demo, the writer pod mounts a PVC read-write:

volumes:
- name: data
persistentVolumeClaim:
claimName: dynamic-pvc

The reader pod mounts a different PVC with readOnly: true on the volumeMount:

volumes:
- name: data
persistentVolumeClaim:
claimName: manual-pvc

The access mode on the PVC controls node-level attachment. The readOnly flag on the volumeMount controls container-level visibility. They are independent.

The volume can be mounted read-only by many nodes simultaneously. Useful for shared configuration, static assets, or pre-built datasets.

The volume can be mounted read-write by many nodes simultaneously. Requires NFS, CephFS, Amazon EFS, or similar. Block storage does not support RWX.

RWX with concurrent writes requires the application to handle file locking. The storage system provides concurrent access, not concurrent safety. Two pods writing to the same file simultaneously will corrupt it without proper locking.

Added in Kubernetes 1.27. Only one pod cluster-wide can mount the volume read-write. Stricter than RWO (which allows multiple pods on the same node). Useful for databases that assume exclusive write access.

The reclaim policy determines what happens to the PV when its PVC is deleted.

The PV keeps its data and moves to Released state. An administrator must manually clean up. This demo’s static PV uses Retain:

spec:
persistentVolumeReclaimPolicy: Retain
hostPath:
path: /tmp/manual-pv-data

Use Retain for production databases. Deleting a PVC accidentally will not destroy your data.

The PV and underlying storage are both deleted with the PVC. This is the default for dynamically provisioned volumes. Use it for caches, temporary pipelines, and reproducible data.

Ran rm -rf /volume/* and reused the PV. Deprecated because it was too simplistic and insecure. Use dynamic provisioning instead.

StorageClasses can allow resizing with allowVolumeExpansion: true:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: expandable
provisioner: ebs.csi.aws.com
allowVolumeExpansion: true

To expand: edit the PVC and increase spec.resources.requests.storage. The CSI driver handles the rest. Some drivers support online expansion. Others require a pod restart.

You can only expand, never shrink. This is a deliberate safety measure against data loss.

Snapshots capture volume state at a point in time. They need a CSI driver with snapshot support and a VolumeSnapshotClass:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-ebs-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Delete

Create a snapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: my-data-snapshot
spec:
volumeSnapshotClassName: csi-ebs-snapclass
source:
persistentVolumeClaimName: my-data-pvc

Restore by creating a PVC with dataSource:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restored-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 256Mi
dataSource:
name: my-data-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io

The provisioner creates a new volume pre-populated with the snapshot’s data. This is the foundation for database backup/restore, disaster recovery, and environment cloning.

In multi-zone clusters, where a volume is created matters. An EBS volume in us-east-1a cannot be attached to a node in us-east-1b.

The volumeBindingMode on StorageClass controls this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topology-aware
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer

Immediate (default): PV is created as soon as the PVC is created. The provisioner picks a zone. If the pod lands in a different zone, it cannot mount the volume.

WaitForFirstConsumer: PV is not created until a pod using the PVC is scheduled. The provisioner creates the volume in the pod’s zone, guaranteeing accessibility.

For multi-zone clusters, WaitForFirstConsumer should be the default. Immediate only makes sense for single-zone clusters or zone-independent backends like NFS.

Mounts a host directory directly into the pod. Simple and works everywhere. This demo uses it:

spec:
hostPath:
path: /tmp/manual-pv-data

But it is dangerous in production. If the pod moves to a different node, it gets a different directory. There is no capacity enforcement. Security is a concern since pods can access any file on the node. Use hostPath only for single-node development like minikube.

Similar to hostPath but managed as proper PVs. They are topology-aware (the scheduler knows which node has the storage), support capacity tracking, and work with the standard PVC lifecycle. Require mandatory nodeAffinity:

spec:
local:
path: /mnt/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-node-1

Use local volumes when you need local SSD performance (databases, caches) with proper lifecycle management. The trade-off: your pod is pinned to a specific node.

Use RWO or RWOP access, Retain reclaim, provisioned IOPS, WaitForFirstConsumer, and allowVolumeExpansion. StatefulSets give each replica its own PVC:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
replicas: 3
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi

Each replica gets its own PVC (pgdata-postgres-0, pgdata-postgres-1, etc.). Deleting a StatefulSet does not delete its PVCs, protecting against accidental data loss.

RWX access via NFS, CephFS, or EFS. Higher latency than block storage, but works well for media files, uploads, and static assets. All replicas mount the same PVC.

Often emptyDir is enough. For PVC-backed ephemeral storage, use ephemeral volumes:

volumes:
- name: scratch
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 10Gi

Created with the pod, deleted with the pod. Useful for build caches, CI/CD temp files, and ML training scratch space.

This is the core value of PersistentVolumes. This demo shows it: the writer pod appends data, gets deleted, and a new pod reads the same data from the same PVC. Data survives because it lives on the PV, not in the container filesystem.

WorkloadAccess ModeReclaimBinding ModeVolume Type
Single-instance DBRWO/RWOPRetainWaitForFirstConsumerBlock (EBS, GCE PD)
Replicated DBRWORetainWaitForFirstConsumerBlock
Shared filesRWXRetainN/ANFS, EFS, CephFS
Build cacheRWODeleteImmediateBlock or ephemeral
ML training dataROXRetainN/ANFS, S3 via CSI
Temp scratchN/AN/AN/AemptyDir