Skip to content

Velero Backup and Restore: Deep Dive

This document explains why Velero exists, how backup and restore workflows operate, and what trade-offs you face when implementing disaster recovery for Kubernetes clusters. It covers storage backends, volume snapshots, resource filtering, and production considerations for multi-cluster recovery scenarios.

Kubernetes clusters contain more than just container images and environment variables. ConfigMaps, Secrets, PersistentVolumeClaims, CRDs, and their instances represent cluster state. When disaster strikes (accidental deletion, cluster failure, regional outage), you need to recreate all of this.

Every Kubernetes cluster stores its state in etcd. Taking etcd snapshots gives you a low-level backup of the entire cluster. But etcd snapshots have serious limitations:

  • They back up everything or nothing. You cannot restore a single namespace or application.
  • Restoring from etcd replaces the entire cluster state. Any changes since the snapshot are lost.
  • They do not capture PersistentVolume data. PVs live outside etcd as actual storage volumes. An etcd snapshot contains the PV object metadata but not the files on disk.
  • Cross-cluster restore is complex. You cannot easily restore an etcd snapshot from cluster A into cluster B.
  • They require etcd-level access, which application teams rarely have.

Velero operates at the Kubernetes API level, not the etcd level. It serializes Kubernetes resources to JSON and stores them in object storage (S3, GCS, Azure Blob). This gives you:

  • Namespace-level granularity. Backup and restore individual namespaces or applications.
  • Label-based filtering. Back up only resources with specific labels.
  • Cross-cluster portability. Restore backups from production into a new disaster recovery cluster.
  • Volume backup. Integrate with volume snapshots (CSI) or file-level backup (Restic/Kopia).
  • Self-service restore. Developers can restore their own namespaces without cluster admin access.

Velero complements etcd snapshots. Use etcd snapshots for full cluster state recovery. Use Velero for application-level backup, namespace migration, and disaster recovery.

Velero has two main components: the Velero server and the Velero CLI.

The Velero server runs as a Deployment in the velero namespace. It consists of several controllers that watch for backup and restore custom resources.

Velero CLI Velero Server (Pod) Object Storage
| | |
| velero backup create | |
|------------------------------>| |
| | |
| | Create Backup CR |
| |<-----(watches) |
| | |
| | Query Kubernetes API |
| | (GET all resources) |
| | |
| | Serialize to JSON |
| | |
| | Upload backup tarball |
| |-------------------------------->|
| | |
| Check backup status | |
|------------------------------>| |
| | |
| Backup Complete | |
|<------------------------------| |

The backup controller watches for Backup custom resources. When you run velero backup create, the CLI creates a Backup CR. The server sees it, queries the Kubernetes API for all resources matching the backup scope, serializes them to JSON, compresses them into a tarball, and uploads to the configured BackupStorageLocation.

Restore is the reverse process:

Velero CLI Velero Server Object Storage
| | |
| velero restore create | |
|------------------------------>| |
| | |
| | Create Restore CR |
| |<-----(watches) |
| | |
| | Download backup tarball |
| |<--------------------------------|
| | |
| | Extract JSON manifests |
| | |
| | Apply resources to cluster |
| | (kubectl apply equivalent) |
| | |

The restore controller downloads the backup tarball, extracts each resource definition, and applies it to the cluster via the Kubernetes API. Resources are restored in a specific order (namespaces first, then other resources) to avoid dependency issues.

Velero uses plugins to interface with different storage backends and volume snapshot providers. Plugins are separate binaries that run inside the Velero server pod. The server communicates with them over gRPC.

From the demo’s install command:

Terminal window
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0

The --plugins flag tells Velero to download and install the AWS plugin. This plugin handles both S3 storage (via the BackupStorageLocation interface) and EBS volume snapshots (via the VolumeSnapshotter interface).

Other available plugins:

  • velero-plugin-for-gcp: Google Cloud Storage and GCE Persistent Disk snapshots
  • velero-plugin-for-microsoft-azure: Azure Blob Storage and Azure Disk snapshots
  • velero-plugin-for-csi: Generic CSI volume snapshots (works with any CSI driver)

You can write custom plugins to integrate Velero with proprietary storage systems or add custom backup logic.

Velero stores backup data in object storage via BackupStorageLocations (BSL). A BSL defines where backups go.

From the demo’s install command:

Terminal window
--bucket velero \
--backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.velero-demo.svc:9000

This creates a BackupStorageLocation pointing to MinIO running inside the cluster. In production, you would point to external S3:

Terminal window
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-backup-bucket \
--backup-location-config region=us-east-1 \
--secret-file ./credentials-velero

The credentials file contains AWS access keys in a specific format:

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

This file becomes a Kubernetes Secret that Velero uses to authenticate to S3.

MinIO provides an S3-compatible API. This lets you test Velero workflows without an AWS account. From the demo’s MinIO deployment:

# From manifests/minio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
namespace: velero-demo
spec:
replicas: 1
template:
spec:
containers:
- name: minio
image: minio/minio:latest
args:
- server
- /data
- --console-address
- :9001
env:
- name: MINIO_ROOT_USER
value: "minio"
- name: MINIO_ROOT_PASSWORD
value: "minio123"
ports:
- name: api
containerPort: 9000
- name: console
containerPort: 9001

Port 9000 serves the S3-compatible API. Port 9001 serves the web console. Velero talks to port 9000.

MinIO stores data in a PersistentVolume:

# From manifests/minio.yaml
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: minio-pvc

In this demo setup, MinIO and its data live in the same cluster you are backing up. This is fine for learning but defeats the purpose in production. If the cluster fails, you lose both your workload and your backups. Always use external object storage for real disaster recovery.

Production setups often use multiple BackupStorageLocations for geographic redundancy:

Terminal window
velero backup-location create us-east \
--provider aws \
--bucket velero-us-east-1 \
--config region=us-east-1
velero backup-location create us-west \
--provider aws \
--bucket velero-us-west-2 \
--config region=us-west-2

You can specify which location to use for each backup:

Terminal window
velero backup create app-backup --storage-location us-west

Or configure scheduled backups to use different locations for different retention tiers (daily to us-east, weekly to us-west).

A Backup is a custom resource that triggers the backup process:

apiVersion: velero.io/v1
kind: Backup
metadata:
name: demo-backup
namespace: velero
spec:
includedNamespaces:
- velero-demo
storageLocation: default
ttl: 720h # 30 days

When you run velero backup create demo-backup --include-namespaces velero-demo, the CLI creates this CR for you.

The backup tarball contains JSON manifests for all resources in the namespace. From the demo’s sample app:

# From manifests/sample-app.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: velero-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: velero-demo
spec:
replicas: 2
---
apiVersion: v1
kind: Service
metadata:
name: sample-app
namespace: velero-demo

All three resources are serialized and stored in the backup. When you restore, all three come back.

A Restore is also a custom resource:

apiVersion: velero.io/v1
kind: Restore
metadata:
name: demo-backup-20250411123045
namespace: velero
spec:
backupName: demo-backup
includedNamespaces:
- velero-demo

You can restore to a different namespace using namespace mappings:

Terminal window
velero restore create --from-backup demo-backup \
--namespace-mappings velero-demo:new-namespace

This takes the backup from velero-demo and recreates all resources in new-namespace instead.

Schedules automate backup creation on a cron schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- production
ttl: 720h

The demo shows creating a schedule via the CLI:

Terminal window
velero schedule create daily-backup \
--schedule="0 */6 * * *" \
--include-namespaces velero-demo

This creates a backup every 6 hours. The TTL (time to live) controls retention. Backups older than the TTL are automatically deleted from object storage.

Hooks let you run commands inside containers before or after a backup. This is critical for databases that need consistent snapshots.

Pre-backup hook (flush database to disk):

apiVersion: v1
kind: Pod
metadata:
name: postgres-0
annotations:
pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "PGPASSWORD=$POSTGRES_PASSWORD pg_dump -U postgres -d mydb > /tmp/backup.sql"]'
pre.hook.backup.velero.io/timeout: "3m"

Velero runs this command before backing up the pod. The command dumps the database to a file inside the container. Then Velero backs up the PVC, which includes the dump file.

Post-backup hook (clean up):

post.hook.backup.velero.io/command: '["/bin/rm", "/tmp/backup.sql"]'

Without hooks, backing up a running database might capture inconsistent state (half-written transactions, dirty buffers).

Velero supports two approaches for backing up persistent volumes.

Volume snapshots use the CSI driver’s snapshot capability. For AWS EBS, this means creating an EBS snapshot. For GCE Persistent Disks, a PD snapshot.

Enable with:

Terminal window
velero install \
--use-volume-snapshots=true \
--snapshot-location-config region=us-east-1

When you back up a PVC, Velero triggers a VolumeSnapshot via the CSI driver. The snapshot is stored in the cloud provider’s snapshot system (not in the S3 bucket). The backup tarball contains a reference to the snapshot ID.

On restore, Velero creates a new PVC with dataSource pointing to the snapshot. The CSI driver creates a volume pre-populated with the snapshot data.

Pros:

  • Fast. Snapshots are block-level copies.
  • Native to the storage system.
  • Incremental. Most cloud providers only store changed blocks.

Cons:

  • Cloud provider-specific. EBS snapshots only work in AWS.
  • Cannot easily move to a different cloud.
  • Some CSI drivers do not support snapshots.

Restic and Kopia are backup tools that copy files from a volume into object storage. Velero integrates them as an alternative to CSI snapshots.

Enable with:

Terminal window
velero install \
--use-volume-snapshots=false \
--uploader-type=kopia

When you back up a PVC, Velero deploys a helper pod on the same node as the PVC. The helper mounts the PVC and uploads its contents to S3 via Kopia. The data goes into the same S3 bucket as the Kubernetes manifests.

On restore, Velero deploys another helper pod to download the data from S3 and write it into the new PVC.

Pros:

  • Cloud-agnostic. Works anywhere.
  • Backs up to the same S3 bucket as manifests (simpler management).
  • Works with any storage backend (hostPath, NFS, Ceph).

Cons:

  • Slower than snapshots. Files are copied byte-by-byte.
  • Higher CPU and network usage during backup and restore.
  • Requires Velero to schedule helper pods on the same nodes as the volumes.

The demo disables volume snapshots with --use-volume-snapshots=false because minikube’s hostPath provisioner does not support CSI snapshots.

Velero lets you control exactly what gets backed up.

Terminal window
# Backup specific namespaces
velero backup create prod-backup --include-namespaces production,staging
# Backup everything except specific namespaces
velero backup create all-except-default --exclude-namespaces default,kube-system

From the demo:

Terminal window
velero backup create demo-backup --include-namespaces velero-demo

This backs up only resources in the velero-demo namespace.

Terminal window
velero backup create app-only --selector app=sample-app

From the demo’s sample app:

# From manifests/sample-app.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: velero-demo
labels:
app: sample-app

The ConfigMap, Deployment, and Service all have app: sample-app. The backup includes all three.

This is useful for multi-tenant clusters where different teams share namespaces. Each team labels their resources with team: frontend or team: backend, and they can back up only their own workloads.

Terminal window
# Backup everything except ConfigMaps
velero backup create no-configmaps \
--include-namespaces velero-demo \
--exclude-resources configmaps
# Backup only Deployments and Services
velero backup create minimal \
--include-namespaces velero-demo \
--include-resources deployments,services

Useful for compliance scenarios where Secrets must not leave the cluster. You can exclude Secrets from backups and restore them separately via a secure channel.

By default, Velero backs up only namespaced resources. Cluster-scoped resources (ClusterRoles, PersistentVolumes, CRDs) require explicit inclusion:

Terminal window
velero backup create full-cluster \
--include-cluster-resources=true

For namespace-level backups, you typically want --include-cluster-resources=false to avoid conflicts when restoring into a different cluster.

AspectVeleroetcd Snapshot
GranularityNamespace or label-basedEntire cluster
Cross-cluster restoreEasyComplex
PV dataOptional (via snapshots or Restic)Not included
Restore speedSlow (API calls)Fast (direct etcd restore)
Access requiredKubernetes APIetcd access (typically admin-only)

Use etcd snapshots for full cluster disaster recovery. Use Velero for application-level backup and migration.

Kasten K10 is a commercial Kubernetes backup solution. It provides:

  • Integrated UI for backup and restore
  • Application-aware backup (automatic hook generation for databases)
  • Multi-cluster disaster recovery orchestration
  • Advanced policy management (compliance, SLA tracking)

Velero is open source, lightweight, and flexible. K10 is a comprehensive platform with enterprise features and support. If you need backup for dozens of clusters with compliance requirements, K10 may be worth the cost. For most use cases, Velero is sufficient.

For critical databases, consider application-specific backup tools alongside Velero:

  • PostgreSQL: pg_dump, pg_basebackup, WAL archiving
  • MySQL: mysqldump, Percona XtraBackup
  • MongoDB: mongodump, Ops Manager backups

Velero provides cluster-level recovery. Application-specific tools provide point-in-time recovery and transaction-level granularity. Use both.

Configure automated backups with appropriate TTLs:

Terminal window
# Daily full backup, 30-day retention
velero schedule create daily \
--schedule="0 2 * * *" \
--include-namespaces production \
--ttl 720h
# Weekly backup, 90-day retention
velero schedule create weekly \
--schedule="0 3 * * 0" \
--include-namespaces production \
--ttl 2160h

Monitor backup success via Prometheus metrics or the Velero CLI:

Terminal window
velero backup get
velero backup describe daily-20250411020000

Set up alerts when backups fail or take too long.

Cross-Cluster Restore for Disaster Recovery

Section titled “Cross-Cluster Restore for Disaster Recovery”

Test your disaster recovery plan by restoring into a separate cluster:

  1. Stand up a new cluster in a different region or cloud.
  2. Install Velero with the same BackupStorageLocation configuration.
  3. Restore from the latest backup.
Terminal window
# In the DR cluster
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-backup-bucket \
--backup-location-config region=us-east-1 \
--secret-file ./credentials-velero \
--no-default-backup-location=false
velero restore create dr-restore --from-backup daily-20250411020000

Common issues:

  • StorageClass mismatch. The DR cluster may not have the same StorageClasses. Use restore mappings to translate.
  • LoadBalancer IP conflicts. Services with type: LoadBalancer get new IPs in the DR cluster. Update DNS.
  • PVC provisioning delays. Large PVs take time to restore from snapshots.

Object storage should be encrypted at rest (S3 server-side encryption, GCS encryption). For additional security, enable Velero client-side encryption (in development as of 2025).

Grant namespace-scoped backup permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: velero-namespace-backup
namespace: team-a
rules:
- apiGroups: ["velero.io"]
resources: ["backups", "restores"]
verbs: ["create", "get", "list"]

Users can back up and restore their own namespaces without cluster-admin access.

Large clusters generate large backups. A cluster with 50,000 resources can produce multi-GB tarballs. This impacts:

  • Upload time to S3 (may take 10+ minutes)
  • Download time during restore
  • S3 storage costs

Mitigate with:

  • Exclude unnecessary resources (logs, temporary workloads)
  • Use separate schedules for critical and non-critical namespaces
  • Implement backup retention policies to delete old backups

Problem: You back up a namespace, delete everything, restore, and the PVCs are pending.

Cause: PersistentVolumes are cluster-scoped. Your backup included PVCs but not the underlying PVs.

Solution: Either include cluster resources (--include-cluster-resources=true) or use CSI/Restic volume backup to capture PV data, not just metadata.

Problem: Restored PVCs bind to the wrong PVs.

Cause: PV names are globally unique. If you restore into the same cluster, name collisions can occur.

Solution: Delete PVs before restoring, or use namespace mappings to isolate the restored resources.

Problem: Restore fails with “the server could not find the requested resource” errors.

Cause: Custom resources (CRs) are restored before their CustomResourceDefinitions (CRDs).

Solution: Velero restores CRDs first by default. If you excluded cluster resources, restore CRDs manually before restoring CRs:

Terminal window
velero restore create crds-only \
--from-backup demo-backup \
--include-cluster-resources=true \
--include-resources customresourcedefinitions
velero restore create app-restore \
--from-backup demo-backup \
--include-namespaces velero-demo

Problem: Restore fails with “namespace already exists” errors.

Cause: You are restoring into a cluster that already has the namespace.

Solution: Use namespace mappings or delete the existing namespace first. Velero does not overwrite existing resources by default. You can force it with --existing-policy=update:

Terminal window
velero restore create --from-backup demo-backup \
--existing-policy=update

This updates existing resources instead of skipping them. Use cautiously, as it can overwrite live configuration.

Problem: The Velero pod is CrashLooping.

Cause: Invalid BackupStorageLocation configuration (wrong S3 endpoint, bad credentials).

Solution: Check logs:

Terminal window
kubectl logs -n velero deployment/velero

Common errors:

  • NoSuchBucket: The S3 bucket does not exist.
  • InvalidAccessKeyId: Credentials are wrong.
  • RequestTimeout: Network connectivity issue to S3.

Fix the configuration and restart the Velero pod.

Problem: Backups with file-level backup never complete.

Cause: The Restic/Kopia helper pod cannot mount the PVC (wrong node, security context issues).

Solution: Check the helper pod logs:

Terminal window
kubectl logs -n velero <restic-pod-name>

Ensure the PVC’s access mode allows the helper pod to mount it. If using ReadWriteOnce, the helper must run on the same node as the original pod. If that node is down, the backup cannot proceed.