Skip to content

Pod Security: Deep Dive

This document explains Linux capabilities, seccomp profiles, AppArmor, SELinux contexts, Pod Security Admission, Pod Security Standards, and the security settings that harden containers against privilege escalation and breakout.

By default, a container’s main process runs as root (UID 0). This is dangerous for several reasons:

  1. If an attacker exploits a vulnerability in the application, they gain root inside the container.
  2. Root inside the container can exploit kernel vulnerabilities to escape to the host.
  3. Root can modify the container filesystem, replace binaries, and install malware.
  4. With certain capabilities, root can access host resources (network, devices, processes).

The demo shows the difference. The insecure pod runs as root:

apiVersion: v1
kind: Pod
metadata:
name: insecure-pod
namespace: security-demo
spec:
containers:
- name: app
image: busybox:1.36
command: ["sleep", "infinity"]
securityContext:
runAsUser: 0
privileged: false

Running id in this pod returns uid=0(root). The application has far more privileges than it needs.

Linux capabilities break the monolithic “root” privilege into discrete units. Instead of being all-or-nothing root, a process can have specific capabilities.

When a container starts, the container runtime grants a default set of capabilities:

CapabilityWhat It Allows
CAP_CHOWNChange file ownership
CAP_DAC_OVERRIDEBypass file permission checks
CAP_FSETIDSet setuid/setgid bits
CAP_FOWNERBypass ownership checks
CAP_MKNODCreate special files
CAP_NET_RAWUse raw sockets (ping, packet sniffing)
CAP_SETGIDChange process group ID
CAP_SETUIDChange process user ID
CAP_SETFCAPSet file capabilities
CAP_SETPCAPTransfer capabilities
CAP_NET_BIND_SERVICEBind to ports below 1024
CAP_SYS_CHROOTUse chroot
CAP_KILLSend signals to other processes
CAP_AUDIT_WRITEWrite to kernel audit log

These capabilities are not granted by default but are particularly dangerous if added:

CapabilityRisk
CAP_SYS_ADMINNear-root. Mount filesystems, configure namespaces, trace processes. The most dangerous single capability.
CAP_SYS_PTRACETrace and inspect any process. Can read secrets from other processes’ memory.
CAP_NET_ADMINModify network configuration, routing tables, firewall rules.
CAP_SYS_RAWIODirect I/O access to hardware. Can read/write raw disk.
CAP_SYS_MODULELoad and unload kernel modules. Full kernel code execution.
CAP_SYS_BOOTReboot the system.
CAP_DAC_READ_SEARCHRead any file regardless of permissions.
CAP_SYS_TIMESet the system clock. Can break TLS, Kerberos, and time-based security.

The secure pod in the demo drops everything:

securityContext:
capabilities:
drop:
- ALL

This removes all 14 default capabilities. The process can only do what an unprivileged user can do. If the application needs a specific capability (like binding to port 80), add it back individually:

securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE

The principle: drop ALL, add back only what is needed. Never add capabilities speculatively.

Seccomp (Secure Computing) restricts which system calls a process can make. A system call is how a user-space process asks the kernel to do something (read a file, open a network connection, allocate memory).

The RuntimeDefault seccomp profile blocks approximately 44 of the 300+ available system calls. It blocks dangerous calls like reboot, mount, ptrace, and clock_settime while allowing common operations.

securityContext:
seccompProfile:
type: RuntimeDefault

The demo’s secure pod uses this:

spec:
securityContext:
seccompProfile:
type: RuntimeDefault

This is set at the pod level, applying to all containers. Container-level seccompProfile overrides the pod-level setting.

seccompProfile:
type: Unconfined

No restrictions. All system calls are allowed. This is the default if no profile is specified (in some runtimes).

For stricter security, create custom seccomp profiles that only allow the exact system calls your application needs:

{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": ["read", "write", "close", "fstat", "mmap", "mprotect", "munmap", "brk", "exit_group"],
"action": "SCMP_ACT_ALLOW"
}
]
}

The default action is ERRNO (return an error). Only explicitly listed syscalls are allowed. This is extremely restrictive and requires knowing exactly which system calls your application makes.

Tools like strace or the oci-seccomp-bpf-hook can record the system calls an application makes during testing, generating a profile automatically.

Custom profiles are loaded from the node filesystem:

seccompProfile:
type: Localhost
localhostProfile: profiles/my-app.json

The file must exist at /var/lib/kubelet/seccomp/profiles/my-app.json on the node. The Security Profiles Operator can manage these profiles as Kubernetes resources.

AppArmor is a Linux Security Module that confines programs to a limited set of resources. It is path-based: rules specify which files and directories a process can access.

AppArmor profiles are specified as annotations (moving to fields in newer Kubernetes versions):

metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/app: runtime/default

Profile types:

  • runtime/default: Default container profile provided by the runtime
  • localhost/<profile-name>: Custom profile loaded on the node
  • unconfined: No restrictions

AppArmor is available on Debian/Ubuntu-based systems. RHEL/Fedora systems use SELinux instead. They serve the same purpose but use different mechanisms.

SELinux (Security-Enhanced Linux) provides mandatory access control. Every process and file has a security label (context). The kernel checks these labels on every access.

securityContext:
seLinuxOptions:
level: "s0:c123,c456" # MCS label
role: "system_r"
type: "container_t"
user: "system_u"

In OpenShift, SELinux is enforced by default. Containers run with the container_t type, which restricts:

  • Filesystem access to the container’s own files
  • Network access to assigned ports
  • Inter-process communication to the container’s processes

The Multi-Category Security (MCS) label (s0:c123,c456) isolates containers from each other. Each container gets a unique MCS label. Container A cannot read files with container B’s MCS label, even if the file permissions would normally allow it.

Pod Security Admission (PSA) is the built-in admission controller that enforces Pod Security Standards. It replaced the deprecated PodSecurityPolicy (PSP) in Kubernetes v1.25.

PSA is configured per namespace using labels:

apiVersion: v1
kind: Namespace
metadata:
name: security-demo
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
ModeBehavior
enforceViolations are rejected. Pod is not created.
warnViolations trigger a warning in the API response. Pod is created.
auditViolations are logged to the audit log. Pod is created.

The demo uses enforce: baseline and warn: restricted. This means:

  • Pods violating baseline standards are blocked (privileged containers, hostNetwork, etc.)
  • Pods violating restricted standards get a warning but are still created
  • Violations are also written to the audit log

You can pin PSA to a specific Kubernetes version:

labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: v1.28

This is important for clusters that upgrade frequently. Without pinning, the policy definitions change with each Kubernetes version, potentially breaking workloads after an upgrade.

No restrictions. Allows everything. Use for system-level workloads like CNI plugins, storage drivers, and monitoring agents that need host access.

Blocks known privilege escalations while remaining broadly compatible. Prevents:

  • Privileged containers (privileged: true)
  • Host namespaces (hostNetwork, hostPID, hostIPC)
  • Host ports (specific ranges)
  • Dangerous volume types (hostPath)
  • Adding dangerous capabilities (SYS_ADMIN, etc.)
  • Certain seccomp profiles
  • Certain SELinux types

The demo’s privileged pod is blocked by baseline:

apiVersion: v1
kind: Pod
metadata:
name: privileged-pod
namespace: security-demo
spec:
containers:
- name: app
image: busybox:1.36
command: ["sleep", "infinity"]
securityContext:
privileged: true # <-- Blocked by baseline

The strictest level. Requires:

  • Non-root user (runAsNonRoot: true)
  • All capabilities dropped
  • allowPrivilegeEscalation: false
  • Seccomp profile set (RuntimeDefault or Localhost)
  • No writable root filesystem (or at least flagged)

The demo’s secure pod satisfies all restricted requirements:

apiVersion: v1
kind: Pod
metadata:
name: secure-pod
namespace: security-demo
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: busybox:1.36
command: ["sleep", "infinity"]
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}

The allowPrivilegeEscalation field controls the Linux no_new_privs flag. When set to false:

  1. The no_new_privs flag is set on the process.
  2. Setuid and setgid binaries are neutralized. A binary with the setuid bit cannot gain elevated privileges.
  3. Execve cannot grant more privileges than the parent process has.

This is important because even if you run as a non-root user, a setuid binary like sudo or ping could escalate back to root. allowPrivilegeEscalation: false prevents this.

When capabilities are dropped with drop: [ALL], allowPrivilegeEscalation is automatically set to false in Kubernetes v1.25+. But it is good practice to set it explicitly.

Setting readOnlyRootFilesystem: true makes the container’s root filesystem read-only:

securityContext:
readOnlyRootFilesystem: true

This prevents:

  • Attackers from writing malware to the filesystem
  • Applications from modifying their own binaries
  • Unintended filesystem modifications

But many applications need to write temporary files, logs, or caches. The pattern is to mount writable volumes for specific paths:

containers:
- name: app
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache
- name: run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
- name: run
emptyDir: {}

The demo shows this pattern with /tmp mounted as an emptyDir:

volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}

Common paths that need to be writable:

  • /tmp (temporary files)
  • /var/run (PID files, sockets)
  • /var/cache (application caches)
  • /var/log (if logging to files instead of stdout)

PodSecurityPolicy (PSP) was removed in Kubernetes v1.25. Migration to Pod Security Admission involves:

FeaturePSPPSA
GranularityPer-policy with RBAC bindingPer-namespace with labels
MutationCan modify pod spec (add defaults)No mutation, only validation
Custom rulesFlexible, arbitrary rulesThree fixed levels only
ScopeCluster-wide with namespace bindingPer-namespace
  1. Audit first: Set PSA to audit mode on all namespaces. Review the audit log for violations.
  2. Warn second: Switch to warn mode. Users see warnings but pods are not blocked.
  3. Enforce last: After confirming no critical violations, switch to enforce.

The demo namespace demonstrates this graduated approach:

labels:
pod-security.kubernetes.io/enforce: baseline # Block the worst
pod-security.kubernetes.io/warn: restricted # Warn on non-ideal
pod-security.kubernetes.io/audit: restricted # Log everything

PSA has three fixed levels. It cannot:

  • Define custom rules (PSP could allow specific host paths)
  • Mutate pods (PSP could inject default seccomp profiles)
  • Allow fine-grained per-workload exceptions

For custom rules, use external policy engines:

  • OPA/Gatekeeper: General-purpose policy engine with Rego language
  • Kyverno: Kubernetes-native policy engine with YAML policies
  • Kubewarden: Policy engine using WebAssembly policies

Security settings can be specified at two levels:

Pod level (spec.securityContext):

  • runAsUser, runAsGroup, runAsNonRoot
  • fsGroup, fsGroupChangePolicy
  • seccompProfile
  • supplementalGroups
  • sysctls

Container level (spec.containers[*].securityContext):

  • runAsUser, runAsGroup, runAsNonRoot
  • readOnlyRootFilesystem
  • allowPrivilegeEscalation
  • capabilities
  • privileged
  • seccompProfile
  • seLinuxOptions

Container-level settings override pod-level settings. In the demo, seccompProfile is set at the pod level (applies to all containers), while capabilities and readOnlyRootFilesystem are set per-container.

For every production workload:

  1. runAsNonRoot: true at pod level
  2. runAsUser: <non-zero> at pod level
  3. allowPrivilegeEscalation: false on every container
  4. readOnlyRootFilesystem: true with writable emptyDir mounts
  5. capabilities.drop: [ALL] on every container
  6. seccompProfile.type: RuntimeDefault at pod level
  7. Resource limits set (prevents DoS via resource exhaustion)
  8. No hostNetwork, hostPID, hostIPC
  9. No hostPath volumes (use PVC instead)
  10. Images scanned for vulnerabilities
  11. Images from trusted registries only
  12. Namespace enforces at least baseline PSS