Pod Security: Deep Dive
This document explains Linux capabilities, seccomp profiles, AppArmor, SELinux contexts, Pod Security Admission, Pod Security Standards, and the security settings that harden containers against privilege escalation and breakout.
Why Running as Root Is Dangerous
Section titled “Why Running as Root Is Dangerous”By default, a container’s main process runs as root (UID 0). This is dangerous for several reasons:
- If an attacker exploits a vulnerability in the application, they gain root inside the container.
- Root inside the container can exploit kernel vulnerabilities to escape to the host.
- Root can modify the container filesystem, replace binaries, and install malware.
- With certain capabilities, root can access host resources (network, devices, processes).
The demo shows the difference. The insecure pod runs as root:
apiVersion: v1kind: Podmetadata: name: insecure-pod namespace: security-demospec: containers: - name: app image: busybox:1.36 command: ["sleep", "infinity"] securityContext: runAsUser: 0 privileged: falseRunning id in this pod returns uid=0(root). The application has far more privileges than it needs.
Linux Capabilities
Section titled “Linux Capabilities”Linux capabilities break the monolithic “root” privilege into discrete units. Instead of being all-or-nothing root, a process can have specific capabilities.
Default Capabilities in Docker/Containerd
Section titled “Default Capabilities in Docker/Containerd”When a container starts, the container runtime grants a default set of capabilities:
| Capability | What It Allows |
|---|---|
CAP_CHOWN | Change file ownership |
CAP_DAC_OVERRIDE | Bypass file permission checks |
CAP_FSETID | Set setuid/setgid bits |
CAP_FOWNER | Bypass ownership checks |
CAP_MKNOD | Create special files |
CAP_NET_RAW | Use raw sockets (ping, packet sniffing) |
CAP_SETGID | Change process group ID |
CAP_SETUID | Change process user ID |
CAP_SETFCAP | Set file capabilities |
CAP_SETPCAP | Transfer capabilities |
CAP_NET_BIND_SERVICE | Bind to ports below 1024 |
CAP_SYS_CHROOT | Use chroot |
CAP_KILL | Send signals to other processes |
CAP_AUDIT_WRITE | Write to kernel audit log |
Dangerous Capabilities
Section titled “Dangerous Capabilities”These capabilities are not granted by default but are particularly dangerous if added:
| Capability | Risk |
|---|---|
CAP_SYS_ADMIN | Near-root. Mount filesystems, configure namespaces, trace processes. The most dangerous single capability. |
CAP_SYS_PTRACE | Trace and inspect any process. Can read secrets from other processes’ memory. |
CAP_NET_ADMIN | Modify network configuration, routing tables, firewall rules. |
CAP_SYS_RAWIO | Direct I/O access to hardware. Can read/write raw disk. |
CAP_SYS_MODULE | Load and unload kernel modules. Full kernel code execution. |
CAP_SYS_BOOT | Reboot the system. |
CAP_DAC_READ_SEARCH | Read any file regardless of permissions. |
CAP_SYS_TIME | Set the system clock. Can break TLS, Kerberos, and time-based security. |
Dropping All Capabilities
Section titled “Dropping All Capabilities”The secure pod in the demo drops everything:
securityContext: capabilities: drop: - ALLThis removes all 14 default capabilities. The process can only do what an unprivileged user can do. If the application needs a specific capability (like binding to port 80), add it back individually:
securityContext: capabilities: drop: - ALL add: - NET_BIND_SERVICEThe principle: drop ALL, add back only what is needed. Never add capabilities speculatively.
Seccomp Profiles
Section titled “Seccomp Profiles”Seccomp (Secure Computing) restricts which system calls a process can make. A system call is how a user-space process asks the kernel to do something (read a file, open a network connection, allocate memory).
RuntimeDefault Profile
Section titled “RuntimeDefault Profile”The RuntimeDefault seccomp profile blocks approximately 44 of the 300+ available system calls. It blocks dangerous calls like reboot, mount, ptrace, and clock_settime while allowing common operations.
securityContext: seccompProfile: type: RuntimeDefaultThe demo’s secure pod uses this:
spec: securityContext: seccompProfile: type: RuntimeDefaultThis is set at the pod level, applying to all containers. Container-level seccompProfile overrides the pod-level setting.
Unconfined Profile
Section titled “Unconfined Profile”seccompProfile: type: UnconfinedNo restrictions. All system calls are allowed. This is the default if no profile is specified (in some runtimes).
Custom Profiles
Section titled “Custom Profiles”For stricter security, create custom seccomp profiles that only allow the exact system calls your application needs:
{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": ["SCMP_ARCH_X86_64"], "syscalls": [ { "names": ["read", "write", "close", "fstat", "mmap", "mprotect", "munmap", "brk", "exit_group"], "action": "SCMP_ACT_ALLOW" } ]}The default action is ERRNO (return an error). Only explicitly listed syscalls are allowed. This is extremely restrictive and requires knowing exactly which system calls your application makes.
Tools like strace or the oci-seccomp-bpf-hook can record the system calls an application makes during testing, generating a profile automatically.
Localhost Profiles
Section titled “Localhost Profiles”Custom profiles are loaded from the node filesystem:
seccompProfile: type: Localhost localhostProfile: profiles/my-app.jsonThe file must exist at /var/lib/kubelet/seccomp/profiles/my-app.json on the node. The Security Profiles Operator can manage these profiles as Kubernetes resources.
AppArmor Profiles
Section titled “AppArmor Profiles”AppArmor is a Linux Security Module that confines programs to a limited set of resources. It is path-based: rules specify which files and directories a process can access.
AppArmor profiles are specified as annotations (moving to fields in newer Kubernetes versions):
metadata: annotations: container.apparmor.security.beta.kubernetes.io/app: runtime/defaultProfile types:
runtime/default: Default container profile provided by the runtimelocalhost/<profile-name>: Custom profile loaded on the nodeunconfined: No restrictions
AppArmor is available on Debian/Ubuntu-based systems. RHEL/Fedora systems use SELinux instead. They serve the same purpose but use different mechanisms.
SELinux Contexts
Section titled “SELinux Contexts”SELinux (Security-Enhanced Linux) provides mandatory access control. Every process and file has a security label (context). The kernel checks these labels on every access.
securityContext: seLinuxOptions: level: "s0:c123,c456" # MCS label role: "system_r" type: "container_t" user: "system_u"In OpenShift, SELinux is enforced by default. Containers run with the container_t type, which restricts:
- Filesystem access to the container’s own files
- Network access to assigned ports
- Inter-process communication to the container’s processes
The Multi-Category Security (MCS) label (s0:c123,c456) isolates containers from each other. Each container gets a unique MCS label. Container A cannot read files with container B’s MCS label, even if the file permissions would normally allow it.
Pod Security Admission
Section titled “Pod Security Admission”Pod Security Admission (PSA) is the built-in admission controller that enforces Pod Security Standards. It replaced the deprecated PodSecurityPolicy (PSP) in Kubernetes v1.25.
Configuration via Namespace Labels
Section titled “Configuration via Namespace Labels”PSA is configured per namespace using labels:
apiVersion: v1kind: Namespacemetadata: name: security-demo labels: pod-security.kubernetes.io/enforce: baseline pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/audit: restrictedThree Modes
Section titled “Three Modes”| Mode | Behavior |
|---|---|
enforce | Violations are rejected. Pod is not created. |
warn | Violations trigger a warning in the API response. Pod is created. |
audit | Violations are logged to the audit log. Pod is created. |
The demo uses enforce: baseline and warn: restricted. This means:
- Pods violating baseline standards are blocked (privileged containers, hostNetwork, etc.)
- Pods violating restricted standards get a warning but are still created
- Violations are also written to the audit log
Version Pinning
Section titled “Version Pinning”You can pin PSA to a specific Kubernetes version:
labels: pod-security.kubernetes.io/enforce: baseline pod-security.kubernetes.io/enforce-version: v1.28This is important for clusters that upgrade frequently. Without pinning, the policy definitions change with each Kubernetes version, potentially breaking workloads after an upgrade.
Pod Security Standards (PSS) Levels
Section titled “Pod Security Standards (PSS) Levels”Privileged
Section titled “Privileged”No restrictions. Allows everything. Use for system-level workloads like CNI plugins, storage drivers, and monitoring agents that need host access.
Baseline
Section titled “Baseline”Blocks known privilege escalations while remaining broadly compatible. Prevents:
- Privileged containers (
privileged: true) - Host namespaces (
hostNetwork,hostPID,hostIPC) - Host ports (specific ranges)
- Dangerous volume types (
hostPath) - Adding dangerous capabilities (SYS_ADMIN, etc.)
- Certain seccomp profiles
- Certain SELinux types
The demo’s privileged pod is blocked by baseline:
apiVersion: v1kind: Podmetadata: name: privileged-pod namespace: security-demospec: containers: - name: app image: busybox:1.36 command: ["sleep", "infinity"] securityContext: privileged: true # <-- Blocked by baselineRestricted
Section titled “Restricted”The strictest level. Requires:
- Non-root user (
runAsNonRoot: true) - All capabilities dropped
allowPrivilegeEscalation: false- Seccomp profile set (RuntimeDefault or Localhost)
- No writable root filesystem (or at least flagged)
The demo’s secure pod satisfies all restricted requirements:
apiVersion: v1kind: Podmetadata: name: secure-pod namespace: security-demospec: securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault containers: - name: app image: busybox:1.36 command: ["sleep", "infinity"] securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL volumeMounts: - name: tmp mountPath: /tmp volumes: - name: tmp emptyDir: {}allowPrivilegeEscalation Mechanics
Section titled “allowPrivilegeEscalation Mechanics”The allowPrivilegeEscalation field controls the Linux no_new_privs flag. When set to false:
- The
no_new_privsflag is set on the process. - Setuid and setgid binaries are neutralized. A binary with the setuid bit cannot gain elevated privileges.
- Execve cannot grant more privileges than the parent process has.
This is important because even if you run as a non-root user, a setuid binary like sudo or ping could escalate back to root. allowPrivilegeEscalation: false prevents this.
When capabilities are dropped with drop: [ALL], allowPrivilegeEscalation is automatically set to false in Kubernetes v1.25+. But it is good practice to set it explicitly.
readOnlyRootFilesystem Patterns
Section titled “readOnlyRootFilesystem Patterns”Setting readOnlyRootFilesystem: true makes the container’s root filesystem read-only:
securityContext: readOnlyRootFilesystem: trueThis prevents:
- Attackers from writing malware to the filesystem
- Applications from modifying their own binaries
- Unintended filesystem modifications
But many applications need to write temporary files, logs, or caches. The pattern is to mount writable volumes for specific paths:
containers: - name: app securityContext: readOnlyRootFilesystem: true volumeMounts: - name: tmp mountPath: /tmp - name: cache mountPath: /var/cache - name: run mountPath: /var/runvolumes: - name: tmp emptyDir: {} - name: cache emptyDir: {} - name: run emptyDir: {}The demo shows this pattern with /tmp mounted as an emptyDir:
volumeMounts: - name: tmp mountPath: /tmpvolumes: - name: tmp emptyDir: {}Common paths that need to be writable:
/tmp(temporary files)/var/run(PID files, sockets)/var/cache(application caches)/var/log(if logging to files instead of stdout)
Migrating from PodSecurityPolicy
Section titled “Migrating from PodSecurityPolicy”PodSecurityPolicy (PSP) was removed in Kubernetes v1.25. Migration to Pod Security Admission involves:
Key Differences
Section titled “Key Differences”| Feature | PSP | PSA |
|---|---|---|
| Granularity | Per-policy with RBAC binding | Per-namespace with labels |
| Mutation | Can modify pod spec (add defaults) | No mutation, only validation |
| Custom rules | Flexible, arbitrary rules | Three fixed levels only |
| Scope | Cluster-wide with namespace binding | Per-namespace |
Migration Strategy
Section titled “Migration Strategy”- Audit first: Set PSA to
auditmode on all namespaces. Review the audit log for violations. - Warn second: Switch to
warnmode. Users see warnings but pods are not blocked. - Enforce last: After confirming no critical violations, switch to
enforce.
The demo namespace demonstrates this graduated approach:
labels: pod-security.kubernetes.io/enforce: baseline # Block the worst pod-security.kubernetes.io/warn: restricted # Warn on non-ideal pod-security.kubernetes.io/audit: restricted # Log everythingWhat PSA Cannot Replace
Section titled “What PSA Cannot Replace”PSA has three fixed levels. It cannot:
- Define custom rules (PSP could allow specific host paths)
- Mutate pods (PSP could inject default seccomp profiles)
- Allow fine-grained per-workload exceptions
For custom rules, use external policy engines:
- OPA/Gatekeeper: General-purpose policy engine with Rego language
- Kyverno: Kubernetes-native policy engine with YAML policies
- Kubewarden: Policy engine using WebAssembly policies
Security Context Hierarchy
Section titled “Security Context Hierarchy”Security settings can be specified at two levels:
Pod level (spec.securityContext):
runAsUser,runAsGroup,runAsNonRootfsGroup,fsGroupChangePolicyseccompProfilesupplementalGroupssysctls
Container level (spec.containers[*].securityContext):
runAsUser,runAsGroup,runAsNonRootreadOnlyRootFilesystemallowPrivilegeEscalationcapabilitiesprivilegedseccompProfileseLinuxOptions
Container-level settings override pod-level settings. In the demo, seccompProfile is set at the pod level (applies to all containers), while capabilities and readOnlyRootFilesystem are set per-container.
Production Security Checklist
Section titled “Production Security Checklist”For every production workload:
runAsNonRoot: trueat pod levelrunAsUser: <non-zero>at pod levelallowPrivilegeEscalation: falseon every containerreadOnlyRootFilesystem: truewith writable emptyDir mountscapabilities.drop: [ALL]on every containerseccompProfile.type: RuntimeDefaultat pod level- Resource limits set (prevents DoS via resource exhaustion)
- No
hostNetwork,hostPID,hostIPC - No
hostPathvolumes (use PVC instead) - Images scanned for vulnerabilities
- Images from trusted registries only
- Namespace enforces at least
baselinePSS