Pod Security: Deep Dive

This document explains Linux capabilities, seccomp profiles, AppArmor, SELinux contexts, Pod Security Admission, Pod Security Standards, and the security settings that harden containers against privilege escalation and breakout.

Why Running as Root Is Dangerous

By default, a container’s main process runs as root (UID 0). This is dangerous for several reasons:

If an attacker exploits a vulnerability in the application, they gain root inside the container.
Root inside the container can exploit kernel vulnerabilities to escape to the host.
Root can modify the container filesystem, replace binaries, and install malware.
With certain capabilities, root can access host resources (network, devices, processes).

The demo shows the difference. The insecure pod runs as root:

apiVersion: v1
kind: Pod
metadata:
  name: insecure-pod
  namespace: security-demo
spec:
  containers:
    - name: app
      image: busybox:1.36
      command: ["sleep", "infinity"]
      securityContext:
        runAsUser: 0
        privileged: false

Running id in this pod returns uid=0(root). The application has far more privileges than it needs.

Linux Capabilities

Linux capabilities break the monolithic “root” privilege into discrete units. Instead of being all-or-nothing root, a process can have specific capabilities.

Default Capabilities in Docker/Containerd

When a container starts, the container runtime grants a default set of capabilities:

Capability	What It Allows
`CAP_CHOWN`	Change file ownership
`CAP_DAC_OVERRIDE`	Bypass file permission checks
`CAP_FSETID`	Set setuid/setgid bits
`CAP_FOWNER`	Bypass ownership checks
`CAP_MKNOD`	Create special files
`CAP_NET_RAW`	Use raw sockets (ping, packet sniffing)
`CAP_SETGID`	Change process group ID
`CAP_SETUID`	Change process user ID
`CAP_SETFCAP`	Set file capabilities
`CAP_SETPCAP`	Transfer capabilities
`CAP_NET_BIND_SERVICE`	Bind to ports below 1024
`CAP_SYS_CHROOT`	Use chroot
`CAP_KILL`	Send signals to other processes
`CAP_AUDIT_WRITE`	Write to kernel audit log

Dangerous Capabilities

These capabilities are not granted by default but are particularly dangerous if added:

Capability	Risk
`CAP_SYS_ADMIN`	Near-root. Mount filesystems, configure namespaces, trace processes. The most dangerous single capability.
`CAP_SYS_PTRACE`	Trace and inspect any process. Can read secrets from other processes’ memory.
`CAP_NET_ADMIN`	Modify network configuration, routing tables, firewall rules.
`CAP_SYS_RAWIO`	Direct I/O access to hardware. Can read/write raw disk.
`CAP_SYS_MODULE`	Load and unload kernel modules. Full kernel code execution.
`CAP_SYS_BOOT`	Reboot the system.
`CAP_DAC_READ_SEARCH`	Read any file regardless of permissions.
`CAP_SYS_TIME`	Set the system clock. Can break TLS, Kerberos, and time-based security.

Dropping All Capabilities

The secure pod in the demo drops everything:

securityContext:
  capabilities:
    drop:
      - ALL

This removes all 14 default capabilities. The process can only do what an unprivileged user can do. If the application needs a specific capability (like binding to port 80), add it back individually:

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

The principle: drop ALL, add back only what is needed. Never add capabilities speculatively.

Seccomp Profiles

Seccomp (Secure Computing) restricts which system calls a process can make. A system call is how a user-space process asks the kernel to do something (read a file, open a network connection, allocate memory).

RuntimeDefault Profile

The RuntimeDefault seccomp profile blocks approximately 44 of the 300+ available system calls. It blocks dangerous calls like reboot, mount, ptrace, and clock_settime while allowing common operations.

securityContext:
  seccompProfile:
    type: RuntimeDefault

The demo’s secure pod uses this:

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

This is set at the pod level, applying to all containers. Container-level seccompProfile overrides the pod-level setting.

Unconfined Profile

seccompProfile:
  type: Unconfined

No restrictions. All system calls are allowed. This is the default if no profile is specified (in some runtimes).

Custom Profiles

For stricter security, create custom seccomp profiles that only allow the exact system calls your application needs:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "close", "fstat", "mmap", "mprotect", "munmap", "brk", "exit_group"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The default action is ERRNO (return an error). Only explicitly listed syscalls are allowed. This is extremely restrictive and requires knowing exactly which system calls your application makes.

Tools like strace or the oci-seccomp-bpf-hook can record the system calls an application makes during testing, generating a profile automatically.

Localhost Profiles

Custom profiles are loaded from the node filesystem:

seccompProfile:
  type: Localhost
  localhostProfile: profiles/my-app.json

The file must exist at /var/lib/kubelet/seccomp/profiles/my-app.json on the node. The Security Profiles Operator can manage these profiles as Kubernetes resources.

AppArmor Profiles

AppArmor is a Linux Security Module that confines programs to a limited set of resources. It is path-based: rules specify which files and directories a process can access.

AppArmor profiles are specified as annotations (moving to fields in newer Kubernetes versions):

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: runtime/default

Profile types:

runtime/default: Default container profile provided by the runtime
localhost/<profile-name>: Custom profile loaded on the node
unconfined: No restrictions

AppArmor is available on Debian/Ubuntu-based systems. RHEL/Fedora systems use SELinux instead. They serve the same purpose but use different mechanisms.

SELinux Contexts

SELinux (Security-Enhanced Linux) provides mandatory access control. Every process and file has a security label (context). The kernel checks these labels on every access.

securityContext:
  seLinuxOptions:
    level: "s0:c123,c456"    # MCS label
    role: "system_r"
    type: "container_t"
    user: "system_u"

In OpenShift, SELinux is enforced by default. Containers run with the container_t type, which restricts:

Filesystem access to the container’s own files
Network access to assigned ports
Inter-process communication to the container’s processes

The Multi-Category Security (MCS) label (s0:c123,c456) isolates containers from each other. Each container gets a unique MCS label. Container A cannot read files with container B’s MCS label, even if the file permissions would normally allow it.

Pod Security Admission

Pod Security Admission (PSA) is the built-in admission controller that enforces Pod Security Standards. It replaced the deprecated PodSecurityPolicy (PSP) in Kubernetes v1.25.

Configuration via Namespace Labels

PSA is configured per namespace using labels:

apiVersion: v1
kind: Namespace
metadata:
  name: security-demo
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

Three Modes

Mode	Behavior
`enforce`	Violations are rejected. Pod is not created.
`warn`	Violations trigger a warning in the API response. Pod is created.
`audit`	Violations are logged to the audit log. Pod is created.

The demo uses enforce: baseline and warn: restricted. This means:

Pods violating baseline standards are blocked (privileged containers, hostNetwork, etc.)
Pods violating restricted standards get a warning but are still created
Violations are also written to the audit log

Version Pinning

You can pin PSA to a specific Kubernetes version:

labels:
  pod-security.kubernetes.io/enforce: baseline
  pod-security.kubernetes.io/enforce-version: v1.28

This is important for clusters that upgrade frequently. Without pinning, the policy definitions change with each Kubernetes version, potentially breaking workloads after an upgrade.

Pod Security Standards (PSS) Levels

Privileged

No restrictions. Allows everything. Use for system-level workloads like CNI plugins, storage drivers, and monitoring agents that need host access.

Baseline

Blocks known privilege escalations while remaining broadly compatible. Prevents:

Privileged containers (privileged: true)
Host namespaces (hostNetwork, hostPID, hostIPC)
Host ports (specific ranges)
Dangerous volume types (hostPath)
Adding dangerous capabilities (SYS_ADMIN, etc.)
Certain seccomp profiles
Certain SELinux types

The demo’s privileged pod is blocked by baseline:

apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
  namespace: security-demo
spec:
  containers:
    - name: app
      image: busybox:1.36
      command: ["sleep", "infinity"]
      securityContext:
        privileged: true       # <-- Blocked by baseline

Restricted

The strictest level. Requires:

Non-root user (runAsNonRoot: true)
All capabilities dropped
allowPrivilegeEscalation: false
Seccomp profile set (RuntimeDefault or Localhost)
No writable root filesystem (or at least flagged)

The demo’s secure pod satisfies all restricted requirements:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  namespace: security-demo
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: busybox:1.36
      command: ["sleep", "infinity"]
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
      volumeMounts:
        - name: tmp
          mountPath: /tmp
  volumes:
    - name: tmp
      emptyDir: {}

allowPrivilegeEscalation Mechanics

The allowPrivilegeEscalation field controls the Linux no_new_privs flag. When set to false:

The no_new_privs flag is set on the process.
Setuid and setgid binaries are neutralized. A binary with the setuid bit cannot gain elevated privileges.
Execve cannot grant more privileges than the parent process has.

This is important because even if you run as a non-root user, a setuid binary like sudo or ping could escalate back to root. allowPrivilegeEscalation: false prevents this.

When capabilities are dropped with drop: [ALL], allowPrivilegeEscalation is automatically set to false in Kubernetes v1.25+. But it is good practice to set it explicitly.

readOnlyRootFilesystem Patterns

Setting readOnlyRootFilesystem: true makes the container’s root filesystem read-only:

securityContext:
  readOnlyRootFilesystem: true

This prevents:

Attackers from writing malware to the filesystem
Applications from modifying their own binaries
Unintended filesystem modifications

But many applications need to write temporary files, logs, or caches. The pattern is to mount writable volumes for specific paths:

containers:
  - name: app
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
      - name: tmp
        mountPath: /tmp
      - name: cache
        mountPath: /var/cache
      - name: run
        mountPath: /var/run
volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir: {}
  - name: run
    emptyDir: {}

The demo shows this pattern with /tmp mounted as an emptyDir:

volumeMounts:
  - name: tmp
    mountPath: /tmp
volumes:
  - name: tmp
    emptyDir: {}

Common paths that need to be writable:

/tmp (temporary files)
/var/run (PID files, sockets)
/var/cache (application caches)
/var/log (if logging to files instead of stdout)

Migrating from PodSecurityPolicy

PodSecurityPolicy (PSP) was removed in Kubernetes v1.25. Migration to Pod Security Admission involves:

Key Differences

Feature	PSP	PSA
Granularity	Per-policy with RBAC binding	Per-namespace with labels
Mutation	Can modify pod spec (add defaults)	No mutation, only validation
Custom rules	Flexible, arbitrary rules	Three fixed levels only
Scope	Cluster-wide with namespace binding	Per-namespace

Migration Strategy

Audit first: Set PSA to audit mode on all namespaces. Review the audit log for violations.
Warn second: Switch to warn mode. Users see warnings but pods are not blocked.
Enforce last: After confirming no critical violations, switch to enforce.

The demo namespace demonstrates this graduated approach:

labels:
  pod-security.kubernetes.io/enforce: baseline    # Block the worst
  pod-security.kubernetes.io/warn: restricted     # Warn on non-ideal
  pod-security.kubernetes.io/audit: restricted    # Log everything

What PSA Cannot Replace

PSA has three fixed levels. It cannot:

Define custom rules (PSP could allow specific host paths)
Mutate pods (PSP could inject default seccomp profiles)
Allow fine-grained per-workload exceptions

For custom rules, use external policy engines:

OPA/Gatekeeper: General-purpose policy engine with Rego language
Kyverno: Kubernetes-native policy engine with YAML policies
Kubewarden: Policy engine using WebAssembly policies

Security Context Hierarchy

Security settings can be specified at two levels:

Pod level (spec.securityContext):

runAsUser, runAsGroup, runAsNonRoot
fsGroup, fsGroupChangePolicy
seccompProfile
supplementalGroups
sysctls

Container level (spec.containers[*].securityContext):

runAsUser, runAsGroup, runAsNonRoot
readOnlyRootFilesystem
allowPrivilegeEscalation
capabilities
privileged
seccompProfile
seLinuxOptions

Container-level settings override pod-level settings. In the demo, seccompProfile is set at the pod level (applies to all containers), while capabilities and readOnlyRootFilesystem are set per-container.

Production Security Checklist

For every production workload:

runAsNonRoot: true at pod level
runAsUser: <non-zero> at pod level
allowPrivilegeEscalation: false on every container
readOnlyRootFilesystem: true with writable emptyDir mounts
capabilities.drop: [ALL] on every container
seccompProfile.type: RuntimeDefault at pod level
Resource limits set (prevents DoS via resource exhaustion)
No hostNetwork, hostPID, hostIPC
No hostPath volumes (use PVC instead)
Images scanned for vulnerabilities
Images from trusted registries only
Namespace enforces at least baseline PSS