Velero Backup and Restore: Deep Dive

This document explains why Velero exists, how backup and restore workflows operate, and what trade-offs you face when implementing disaster recovery for Kubernetes clusters. It covers storage backends, volume snapshots, resource filtering, and production considerations for multi-cluster recovery scenarios.

Why Velero Exists

Kubernetes clusters contain more than just container images and environment variables. ConfigMaps, Secrets, PersistentVolumeClaims, CRDs, and their instances represent cluster state. When disaster strikes (accidental deletion, cluster failure, regional outage), you need to recreate all of this.

What etcd Snapshots Cannot Do

Every Kubernetes cluster stores its state in etcd. Taking etcd snapshots gives you a low-level backup of the entire cluster. But etcd snapshots have serious limitations:

They back up everything or nothing. You cannot restore a single namespace or application.
Restoring from etcd replaces the entire cluster state. Any changes since the snapshot are lost.
They do not capture PersistentVolume data. PVs live outside etcd as actual storage volumes. An etcd snapshot contains the PV object metadata but not the files on disk.
Cross-cluster restore is complex. You cannot easily restore an etcd snapshot from cluster A into cluster B.
They require etcd-level access, which application teams rarely have.

What Velero Provides

Velero operates at the Kubernetes API level, not the etcd level. It serializes Kubernetes resources to JSON and stores them in object storage (S3, GCS, Azure Blob). This gives you:

Namespace-level granularity. Backup and restore individual namespaces or applications.
Label-based filtering. Back up only resources with specific labels.
Cross-cluster portability. Restore backups from production into a new disaster recovery cluster.
Volume backup. Integrate with volume snapshots (CSI) or file-level backup (Restic/Kopia).
Self-service restore. Developers can restore their own namespaces without cluster admin access.

Velero complements etcd snapshots. Use etcd snapshots for full cluster state recovery. Use Velero for application-level backup, namespace migration, and disaster recovery.

How Velero Works

Velero has two main components: the Velero server and the Velero CLI.

Server Architecture

The Velero server runs as a Deployment in the velero namespace. It consists of several controllers that watch for backup and restore custom resources.

Velero CLI                    Velero Server (Pod)                Object Storage
    |                               |                                 |
    | velero backup create          |                                 |
    |------------------------------>|                                 |
    |                               |                                 |
    |                               | Create Backup CR                |
    |                               |<-----(watches)                  |
    |                               |                                 |
    |                               | Query Kubernetes API            |
    |                               | (GET all resources)             |
    |                               |                                 |
    |                               | Serialize to JSON               |
    |                               |                                 |
    |                               | Upload backup tarball           |
    |                               |-------------------------------->|
    |                               |                                 |
    | Check backup status           |                                 |
    |------------------------------>|                                 |
    |                               |                                 |
    | Backup Complete               |                                 |
    |<------------------------------|                                 |

The backup controller watches for Backup custom resources. When you run velero backup create, the CLI creates a Backup CR. The server sees it, queries the Kubernetes API for all resources matching the backup scope, serializes them to JSON, compresses them into a tarball, and uploads to the configured BackupStorageLocation.

Restore Workflow

Restore is the reverse process:

Velero CLI                    Velero Server                   Object Storage
    |                               |                                 |
    | velero restore create         |                                 |
    |------------------------------>|                                 |
    |                               |                                 |
    |                               | Create Restore CR               |
    |                               |<-----(watches)                  |
    |                               |                                 |
    |                               | Download backup tarball         |
    |                               |<--------------------------------|
    |                               |                                 |
    |                               | Extract JSON manifests          |
    |                               |                                 |
    |                               | Apply resources to cluster      |
    |                               | (kubectl apply equivalent)      |
    |                               |                                 |

The restore controller downloads the backup tarball, extracts each resource definition, and applies it to the cluster via the Kubernetes API. Resources are restored in a specific order (namespaces first, then other resources) to avoid dependency issues.

Plugin Architecture

Velero uses plugins to interface with different storage backends and volume snapshot providers. Plugins are separate binaries that run inside the Velero server pod. The server communicates with them over gRPC.

From the demo’s install command:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0

The --plugins flag tells Velero to download and install the AWS plugin. This plugin handles both S3 storage (via the BackupStorageLocation interface) and EBS volume snapshots (via the VolumeSnapshotter interface).

Other available plugins:

velero-plugin-for-gcp: Google Cloud Storage and GCE Persistent Disk snapshots
velero-plugin-for-microsoft-azure: Azure Blob Storage and Azure Disk snapshots
velero-plugin-for-csi: Generic CSI volume snapshots (works with any CSI driver)

You can write custom plugins to integrate Velero with proprietary storage systems or add custom backup logic.

Storage Backends

Velero stores backup data in object storage via BackupStorageLocations (BSL). A BSL defines where backups go.

BackupStorageLocation Configuration

From the demo’s install command:

--bucket velero \
--backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.velero-demo.svc:9000

This creates a BackupStorageLocation pointing to MinIO running inside the cluster. In production, you would point to external S3:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-backup-bucket \
  --backup-location-config region=us-east-1 \
  --secret-file ./credentials-velero

The credentials file contains AWS access keys in a specific format:

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

This file becomes a Kubernetes Secret that Velero uses to authenticate to S3.

Why MinIO for Local Testing

MinIO provides an S3-compatible API. This lets you test Velero workflows without an AWS account. From the demo’s MinIO deployment:

# From manifests/minio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  namespace: velero-demo
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: minio
          image: minio/minio:latest
          args:
            - server
            - /data
            - --console-address
            - :9001
          env:
            - name: MINIO_ROOT_USER
              value: "minio"
            - name: MINIO_ROOT_PASSWORD
              value: "minio123"
          ports:
            - name: api
              containerPort: 9000
            - name: console
              containerPort: 9001

Port 9000 serves the S3-compatible API. Port 9001 serves the web console. Velero talks to port 9000.

MinIO stores data in a PersistentVolume:

# From manifests/minio.yaml
volumeMounts:
  - name: data
    mountPath: /data
volumes:
  - name: data
    persistentVolumeClaim:
      claimName: minio-pvc

In this demo setup, MinIO and its data live in the same cluster you are backing up. This is fine for learning but defeats the purpose in production. If the cluster fails, you lose both your workload and your backups. Always use external object storage for real disaster recovery.

Multi-Region Backup

Production setups often use multiple BackupStorageLocations for geographic redundancy:

velero backup-location create us-east \
  --provider aws \
  --bucket velero-us-east-1 \
  --config region=us-east-1

velero backup-location create us-west \
  --provider aws \
  --bucket velero-us-west-2 \
  --config region=us-west-2

You can specify which location to use for each backup:

velero backup create app-backup --storage-location us-west

Or configure scheduled backups to use different locations for different retention tiers (daily to us-east, weekly to us-west).

Key Concepts

Backups

A Backup is a custom resource that triggers the backup process:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: demo-backup
  namespace: velero
spec:
  includedNamespaces:
    - velero-demo
  storageLocation: default
  ttl: 720h  # 30 days

When you run velero backup create demo-backup --include-namespaces velero-demo, the CLI creates this CR for you.

The backup tarball contains JSON manifests for all resources in the namespace. From the demo’s sample app:

# From manifests/sample-app.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: velero-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: velero-demo
spec:
  replicas: 2
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app
  namespace: velero-demo

All three resources are serialized and stored in the backup. When you restore, all three come back.

Restores

A Restore is also a custom resource:

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: demo-backup-20250411123045
  namespace: velero
spec:
  backupName: demo-backup
  includedNamespaces:
    - velero-demo

You can restore to a different namespace using namespace mappings:

velero restore create --from-backup demo-backup \
  --namespace-mappings velero-demo:new-namespace

This takes the backup from velero-demo and recreates all resources in new-namespace instead.

Schedules

Schedules automate backup creation on a cron schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
      - production
    ttl: 720h

The demo shows creating a schedule via the CLI:

velero schedule create daily-backup \
  --schedule="0 */6 * * *" \
  --include-namespaces velero-demo

This creates a backup every 6 hours. The TTL (time to live) controls retention. Backups older than the TTL are automatically deleted from object storage.

Backup Hooks

Hooks let you run commands inside containers before or after a backup. This is critical for databases that need consistent snapshots.

Pre-backup hook (flush database to disk):

apiVersion: v1
kind: Pod
metadata:
  name: postgres-0
  annotations:
    pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "PGPASSWORD=$POSTGRES_PASSWORD pg_dump -U postgres -d mydb > /tmp/backup.sql"]'
    pre.hook.backup.velero.io/timeout: "3m"

Velero runs this command before backing up the pod. The command dumps the database to a file inside the container. Then Velero backs up the PVC, which includes the dump file.

Post-backup hook (clean up):

post.hook.backup.velero.io/command: '["/bin/rm", "/tmp/backup.sql"]'

Without hooks, backing up a running database might capture inconsistent state (half-written transactions, dirty buffers).

Volume Snapshots vs File-Level Backup

Velero supports two approaches for backing up persistent volumes.

Volume Snapshots (CSI)

Volume snapshots use the CSI driver’s snapshot capability. For AWS EBS, this means creating an EBS snapshot. For GCE Persistent Disks, a PD snapshot.

Enable with:

velero install \
  --use-volume-snapshots=true \
  --snapshot-location-config region=us-east-1

When you back up a PVC, Velero triggers a VolumeSnapshot via the CSI driver. The snapshot is stored in the cloud provider’s snapshot system (not in the S3 bucket). The backup tarball contains a reference to the snapshot ID.

On restore, Velero creates a new PVC with dataSource pointing to the snapshot. The CSI driver creates a volume pre-populated with the snapshot data.

Pros:

Fast. Snapshots are block-level copies.
Native to the storage system.
Incremental. Most cloud providers only store changed blocks.

Cons:

Cloud provider-specific. EBS snapshots only work in AWS.
Cannot easily move to a different cloud.
Some CSI drivers do not support snapshots.

File-Level Backup (Restic/Kopia)

Restic and Kopia are backup tools that copy files from a volume into object storage. Velero integrates them as an alternative to CSI snapshots.

Enable with:

velero install \
  --use-volume-snapshots=false \
  --uploader-type=kopia

When you back up a PVC, Velero deploys a helper pod on the same node as the PVC. The helper mounts the PVC and uploads its contents to S3 via Kopia. The data goes into the same S3 bucket as the Kubernetes manifests.

On restore, Velero deploys another helper pod to download the data from S3 and write it into the new PVC.

Pros:

Cloud-agnostic. Works anywhere.
Backs up to the same S3 bucket as manifests (simpler management).
Works with any storage backend (hostPath, NFS, Ceph).

Cons:

Slower than snapshots. Files are copied byte-by-byte.
Higher CPU and network usage during backup and restore.
Requires Velero to schedule helper pods on the same nodes as the volumes.

The demo disables volume snapshots with --use-volume-snapshots=false because minikube’s hostPath provisioner does not support CSI snapshots.

Resource Filtering

Velero lets you control exactly what gets backed up.

Include and Exclude Namespaces

# Backup specific namespaces
velero backup create prod-backup --include-namespaces production,staging

# Backup everything except specific namespaces
velero backup create all-except-default --exclude-namespaces default,kube-system

From the demo:

velero backup create demo-backup --include-namespaces velero-demo

This backs up only resources in the velero-demo namespace.

Label Selectors

velero backup create app-only --selector app=sample-app

From the demo’s sample app:

# From manifests/sample-app.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: velero-demo
  labels:
    app: sample-app

The ConfigMap, Deployment, and Service all have app: sample-app. The backup includes all three.

This is useful for multi-tenant clusters where different teams share namespaces. Each team labels their resources with team: frontend or team: backend, and they can back up only their own workloads.

Resource Type Filters

# Backup everything except ConfigMaps
velero backup create no-configmaps \
  --include-namespaces velero-demo \
  --exclude-resources configmaps

# Backup only Deployments and Services
velero backup create minimal \
  --include-namespaces velero-demo \
  --include-resources deployments,services

Useful for compliance scenarios where Secrets must not leave the cluster. You can exclude Secrets from backups and restore them separately via a secure channel.

Cluster-Scoped Resources

By default, Velero backs up only namespaced resources. Cluster-scoped resources (ClusterRoles, PersistentVolumes, CRDs) require explicit inclusion:

velero backup create full-cluster \
  --include-cluster-resources=true

For namespace-level backups, you typically want --include-cluster-resources=false to avoid conflicts when restoring into a different cluster.

Trade-Offs and Alternatives

Velero vs etcd Snapshot

Aspect	Velero	etcd Snapshot
Granularity	Namespace or label-based	Entire cluster
Cross-cluster restore	Easy	Complex
PV data	Optional (via snapshots or Restic)	Not included
Restore speed	Slow (API calls)	Fast (direct etcd restore)
Access required	Kubernetes API	etcd access (typically admin-only)

Use etcd snapshots for full cluster disaster recovery. Use Velero for application-level backup and migration.

Velero vs Kasten K10

Kasten K10 is a commercial Kubernetes backup solution. It provides:

Integrated UI for backup and restore
Application-aware backup (automatic hook generation for databases)
Multi-cluster disaster recovery orchestration
Advanced policy management (compliance, SLA tracking)

Velero is open source, lightweight, and flexible. K10 is a comprehensive platform with enterprise features and support. If you need backup for dozens of clusters with compliance requirements, K10 may be worth the cost. For most use cases, Velero is sufficient.

Velero vs Application-Specific Backup

For critical databases, consider application-specific backup tools alongside Velero:

PostgreSQL: pg_dump, pg_basebackup, WAL archiving
MySQL: mysqldump, Percona XtraBackup
MongoDB: mongodump, Ops Manager backups

Velero provides cluster-level recovery. Application-specific tools provide point-in-time recovery and transaction-level granularity. Use both.

Production Considerations

Scheduled Backups with Retention

Configure automated backups with appropriate TTLs:

# Daily full backup, 30-day retention
velero schedule create daily \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --ttl 720h

# Weekly backup, 90-day retention
velero schedule create weekly \
  --schedule="0 3 * * 0" \
  --include-namespaces production \
  --ttl 2160h

Monitor backup success via Prometheus metrics or the Velero CLI:

velero backup get
velero backup describe daily-20250411020000

Set up alerts when backups fail or take too long.

Cross-Cluster Restore for Disaster Recovery

Test your disaster recovery plan by restoring into a separate cluster:

Stand up a new cluster in a different region or cloud.
Install Velero with the same BackupStorageLocation configuration.
Restore from the latest backup.

# In the DR cluster
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-backup-bucket \
  --backup-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --no-default-backup-location=false

velero restore create dr-restore --from-backup daily-20250411020000

Common issues:

StorageClass mismatch. The DR cluster may not have the same StorageClasses. Use restore mappings to translate.
LoadBalancer IP conflicts. Services with type: LoadBalancer get new IPs in the DR cluster. Update DNS.
PVC provisioning delays. Large PVs take time to restore from snapshots.

Backup Encryption

Object storage should be encrypted at rest (S3 server-side encryption, GCS encryption). For additional security, enable Velero client-side encryption (in development as of 2025).

RBAC and Multi-Tenancy

Grant namespace-scoped backup permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: velero-namespace-backup
  namespace: team-a
rules:
  - apiGroups: ["velero.io"]
    resources: ["backups", "restores"]
    verbs: ["create", "get", "list"]

Users can back up and restore their own namespaces without cluster-admin access.

Backup Size and Performance

Large clusters generate large backups. A cluster with 50,000 resources can produce multi-GB tarballs. This impacts:

Upload time to S3 (may take 10+ minutes)
Download time during restore
S3 storage costs

Mitigate with:

Exclude unnecessary resources (logs, temporary workloads)
Use separate schedules for critical and non-critical namespaces
Implement backup retention policies to delete old backups

Common Pitfalls

PV Backup Gotchas

Problem: You back up a namespace, delete everything, restore, and the PVCs are pending.

Cause: PersistentVolumes are cluster-scoped. Your backup included PVCs but not the underlying PVs.

Solution: Either include cluster resources (--include-cluster-resources=true) or use CSI/Restic volume backup to capture PV data, not just metadata.

Problem: Restored PVCs bind to the wrong PVs.

Cause: PV names are globally unique. If you restore into the same cluster, name collisions can occur.

Solution: Delete PVs before restoring, or use namespace mappings to isolate the restored resources.

CRD Ordering

Problem: Restore fails with “the server could not find the requested resource” errors.

Cause: Custom resources (CRs) are restored before their CustomResourceDefinitions (CRDs).

Solution: Velero restores CRDs first by default. If you excluded cluster resources, restore CRDs manually before restoring CRs:

velero restore create crds-only \
  --from-backup demo-backup \
  --include-cluster-resources=true \
  --include-resources customresourcedefinitions

velero restore create app-restore \
  --from-backup demo-backup \
  --include-namespaces velero-demo

Namespace Conflicts on Restore

Problem: Restore fails with “namespace already exists” errors.

Cause: You are restoring into a cluster that already has the namespace.

Solution: Use namespace mappings or delete the existing namespace first. Velero does not overwrite existing resources by default. You can force it with --existing-policy=update:

velero restore create --from-backup demo-backup \
  --existing-policy=update

This updates existing resources instead of skipping them. Use cautiously, as it can overwrite live configuration.

Velero Server Pod Crashes

Problem: The Velero pod is CrashLooping.

Cause: Invalid BackupStorageLocation configuration (wrong S3 endpoint, bad credentials).

Solution: Check logs:

kubectl logs -n velero deployment/velero

Common errors:

NoSuchBucket: The S3 bucket does not exist.
InvalidAccessKeyId: Credentials are wrong.
RequestTimeout: Network connectivity issue to S3.

Fix the configuration and restart the Velero pod.

Restic/Kopia Init Container Hangs

Problem: Backups with file-level backup never complete.

Cause: The Restic/Kopia helper pod cannot mount the PVC (wrong node, security context issues).

Solution: Check the helper pod logs:

kubectl logs -n velero <restic-pod-name>

Ensure the PVC’s access mode allows the helper pod to mount it. If using ReadWriteOnce, the helper must run on the same node as the original pod. If that node is down, the backup cannot proceed.