Hello Security: Deep Dive

Why Secure Defaults Matter

By default, containers run with far more privileges than they need. The official nginx image runs as root, binds to port 80, and has write access to its entire filesystem. This is convenient for getting started but dangerous in production.

Before Kubernetes had Pod Security Standards, every team reinvented security baselines. Some teams ran everything as root because it was easier. Others locked down containers but missed important fields. Security became inconsistent across the organization.

This demo shows the five essential security settings that every production container should have. These are not optional hardening steps. They are the baseline. The Restricted Pod Security Standard requires all of them.

The demo deployment passed the Restricted standard in testing. Two pods started in under 3 seconds. The service responded normally. The security settings did not break functionality. They just removed privileges the application never needed.

The Container Attack Surface

Understanding why these settings matter requires understanding how container escapes work.

Containers Are Not Virtual Machines

Containers share the host kernel. A container process is just a Linux process with namespaces and cgroups applied. If a container runs as root (UID 0), it is the same root user as on the host. The only thing separating it from the host is the namespace isolation.

Namespace isolation is strong, but not perfect. Kernel vulnerabilities sometimes allow containers to break out. Many of these exploits require specific conditions:

Container runs as root
Container has dangerous capabilities (SYS_ADMIN, SYS_PTRACE)
Container can make dangerous syscalls (mount, ptrace, reboot)
Container can modify its own filesystem

The demo’s security settings eliminate these conditions. An attacker who compromises the nginx process gets a non-root shell in a read-only filesystem with no capabilities and limited syscalls. Privilege escalation becomes extremely difficult.

How Each Setting Protects You

runAsNonRoot and runAsUser

# From manifests/deployment.yaml
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 101
    runAsGroup: 101
    fsGroup: 101

The runAsNonRoot field is a safety check. It tells Kubernetes to reject the pod if the container image would run as UID 0. This prevents accidental deployments of root containers.

The runAsUser field explicitly sets the UID to 101. This is the nginx user in the nginxinc/nginx-unprivileged image.

Why UID 101 specifically? Container images define default users. The unprivileged nginx image creates a user named nginx with UID 101. We match that UID in the securityContext. If the values conflict (image expects UID 101 but we set UID 1000), the container starts as UID 1000 but may fail if it tries to read files owned by UID 101.

What this blocks:

Running as non-root limits the blast radius of code execution. Many container escapes rely on being root inside the container. The runC CVE-2019-5736 exploit allowed attackers to overwrite the host runC binary, but only if the container ran as root. Non-root containers were immune.

Even without kernel exploits, root can do more damage. Root can read sensitive files (if they leak into the container via misconfigured volumes), modify application code, and install malware.

readOnlyRootFilesystem with Writable Volumes

# From manifests/deployment.yaml
containers:
  - name: nginx
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
      - name: cache
        mountPath: /var/cache/nginx
      - name: run
        mountPath: /var/run
      - name: tmp
        mountPath: /tmp
volumes:
  - name: cache
    emptyDir: {}
  - name: run
    emptyDir: {}
  - name: tmp
    emptyDir: {}

A read-only root filesystem prevents all writes except to mounted volumes. The container image layers are immutable. An attacker cannot:

Install backdoors or malware
Modify the nginx binary
Overwrite configuration files
Create persistence mechanisms
Tamper with logs (if written to the filesystem instead of stdout)

But nginx needs to write temporary files and its PID file. We mount emptyDir volumes at three paths:

/var/cache/nginx: Proxy cache, fastcgi temp files, client body temp files
/var/run: nginx.pid file
/tmp: General temporary files

These volumes are ephemeral. They disappear when the pod is deleted. An attacker can write to them, but the changes do not survive a pod restart.

What this blocks:

Many persistence techniques rely on writing to the filesystem. An attacker might replace a binary with a trojaned version, create a cron job, or write a startup script that runs on reboot. Read-only filesystems eliminate these techniques.

The security audit flagged that our emptyDir volumes lack sizeLimit. Without a limit, an attacker who gains code execution could fill the volume and exhaust node disk space. In production, add limits:

volumes:
  - name: cache
    emptyDir:
      sizeLimit: 100Mi
  - name: run
    emptyDir:
      sizeLimit: 10Mi
  - name: tmp
    emptyDir:
      sizeLimit: 50Mi

These limits prevent a compromised container from impacting other workloads on the node.

allowPrivilegeEscalation

# From manifests/deployment.yaml
securityContext:
  allowPrivilegeEscalation: false

This field controls the Linux no_new_privs flag. When set to false, the kernel prevents a process from gaining more privileges than its parent.

Specifically, this neutralizes setuid and setgid binaries. A setuid binary runs with the file owner’s privileges instead of the caller’s privileges. The classic example is /usr/bin/sudo, which is owned by root with the setuid bit set. When a normal user runs sudo, the process gains root privileges.

With allowPrivilegeEscalation: false, setuid and setgid bits are ignored. Even if an attacker finds a setuid binary in the container image (perhaps left over from the base image), they cannot use it to escalate to root.

What this blocks:

Setuid binaries are a common privilege escalation vector. Vulnerabilities in sudo, su, passwd, and other setuid programs have been used to gain root access for decades.

In containers, these binaries should not exist. But base images sometimes include them. allowPrivilegeEscalation: false ensures they cannot be abused.

Kubernetes automatically sets this to false when you drop all capabilities (as of v1.25+). But it is good practice to set it explicitly.

capabilities.drop: [ALL]

# From manifests/deployment.yaml
securityContext:
  capabilities:
    drop:
      - ALL

Linux capabilities divide root’s powers into discrete units. Instead of all-or-nothing root, a process can have specific capabilities.

By default, container runtimes grant these capabilities:

Capability	What It Allows
`CAP_CHOWN`	Change file ownership
`CAP_NET_BIND_SERVICE`	Bind to ports below 1024
`CAP_SETUID` / `CAP_SETGID`	Change process UID/GID
`CAP_DAC_OVERRIDE`	Bypass file permission checks
`CAP_KILL`	Send signals to other processes
`CAP_NET_RAW`	Use raw sockets (ping, packet sniffing)

Dropping ALL removes even these defaults. The process has no special powers. It can only do what an unprivileged user can do.

When to add capabilities back:

Some applications need specific capabilities. A DNS server might need NET_BIND_SERVICE to listen on port 53. A monitoring tool might need NET_RAW for packet capture.

In those cases, drop all and add back only what is required:

capabilities:
  drop:
    - ALL
  add:
    - NET_BIND_SERVICE

Never add capabilities speculatively. Only add them when the application fails without them, and only add the minimum needed.

What this blocks:

Many capabilities enable dangerous operations. CAP_SYS_ADMIN is near-root power (mount filesystems, configure namespaces, trace processes). CAP_SYS_PTRACE allows reading other processes’ memory, which can leak secrets. CAP_NET_ADMIN allows modifying firewall rules and network routing.

The default capabilities are less dangerous but still powerful. CAP_NET_RAW enables packet sniffing. CAP_SETUID allows changing to any user. Removing them shrinks the attack surface.

seccompProfile: RuntimeDefault

# From manifests/deployment.yaml
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Seccomp (secure computing mode) restricts which system calls a process can make. A system call is how user-space code asks the kernel to do something (open a file, allocate memory, create a network socket).

Linux has over 300 syscalls. Most applications use fewer than 100. The RuntimeDefault profile blocks approximately 44 dangerous syscalls:

reboot, swapon, swapoff: System-level operations
mount, umount, pivot_root: Filesystem manipulation
ptrace: Process tracing and debugging
clock_settime: Changing the system clock
kexec_load: Loading a new kernel

These syscalls are rarely needed in containers and are common in container escape exploits.

Custom seccomp profiles:

For even stricter security, create custom profiles that allow only the syscalls your application uses. Tools like strace can record syscalls during testing:

strace -c -f -S name nginx 2>&1 | tail -n +3 | head -n -2 | awk '{print $NF}'

This generates a list of syscalls nginx makes. Build a seccomp profile that allows only those calls.

The Security Profiles Operator can manage custom profiles as Kubernetes resources, making them easier to deploy and update.

What this blocks:

Many kernel exploits rely on obscure syscalls. The ptrace syscall allows inspecting and modifying other processes, which can leak secrets. The mount syscall is used in some container escape techniques. Blocking these syscalls closes attack vectors.

Why Use nginxinc/nginx-unprivileged?

The official nginx image has two problems for secure deployments:

Runs as root by default: The master process runs as UID 0. Worker processes run as the nginx user, but the master needs root to bind to port 80.
Listens on port 80: Ports below 1024 require the CAP_NET_BIND_SERVICE capability (or root).

The nginxinc/nginx-unprivileged image solves both:

All processes run as UID 101 (nginx user)
Listens on port 8080 instead of 80
Designed to work with readOnlyRootFilesystem: true
Configuration files are in writable directories by default

This makes it compatible with the Restricted Pod Security Standard out of the box.

Other common images have unprivileged variants: bitnami/redis, bitnami/postgresql, bitnami/mongodb. Use these in production.

Security Context Hierarchy

Security settings can be specified at two levels:

Pod level (spec.securityContext):

runAsUser, runAsGroup, runAsNonRoot
fsGroup, fsGroupChangePolicy
seccompProfile
supplementalGroups, sysctls

Container level (spec.containers[*].securityContext):

runAsUser, runAsGroup, runAsNonRoot
readOnlyRootFilesystem
allowPrivilegeEscalation
capabilities
privileged
seccompProfile, seLinuxOptions, appArmorProfile

Container-level settings override pod-level settings. In the demo, seccompProfile is set at the pod level (applies to all containers), while capabilities and readOnlyRootFilesystem are set per-container.

Best practice: Set user and group at the pod level. Different containers in the same pod usually run as the same user. Set filesystem and capability restrictions per-container, since different containers may have different needs.

The fsGroup Field

# From manifests/deployment.yaml
spec:
  securityContext:
    fsGroup: 101

The fsGroup field sets the group ownership of mounted volumes. When a PersistentVolume is mounted, files are owned by fsGroup. Processes in the pod can read and write files if they belong to that group.

In the demo, the emptyDir volumes inherit fsGroup 101. The nginx process runs as UID 101, GID 101, so it can write to /var/cache/nginx and /var/run.

Without fsGroup, the volumes might be owned by root (UID 0, GID 0). The nginx process would fail with permission denied errors when trying to write its PID file.

Security Considerations

The security audit found this demo meets the Restricted Pod Security Standard. All required controls are in place. The demo is a reference implementation of secure container defaults.

The audit flagged three informational items:

1. emptyDir Volumes Lack sizeLimit

The emptyDir volumes have no size limits. An attacker who gains code execution could fill these volumes and exhaust node disk space.

In production, add limits:

volumes:
  - name: cache
    emptyDir:
      sizeLimit: 100Mi

The right limit depends on the application. Nginx’s proxy cache can grow large if caching is enabled. A simple static file server needs very little space.

2. Image Uses Tag Instead of Digest Pinning

The deployment uses nginxinc/nginx-unprivileged:1.25-alpine. This is a mutable tag. The image behind the tag can change.

For maximum reproducibility and security, pin to a digest:

image: nginxinc/nginx-unprivileged:1.25-alpine@sha256:abc123...

Digest pinning ensures the exact same image runs every time. Tags like 1.25-alpine can be overwritten (accidentally or maliciously).

The trade-off: digest pinning makes updates harder. You must explicitly update the digest to get security patches. Many teams prefer tags with automated image scanning and update tools (like Renovate or Dependabot).

3. No NetworkPolicy Present

The demo has no NetworkPolicy. The nginx pods can send traffic to any destination and receive traffic from any source.

NetworkPolicies are out of scope for this demo. They are orthogonal to container security settings. But in production, apply a default-deny NetworkPolicy and explicitly allow only required traffic.

See demo 19 for NetworkPolicy patterns.

What Would Change in Production

This demo focuses on container-level security. Production deployments need additional layers:

NetworkPolicy: Restrict ingress and egress traffic
Image scanning: Use Trivy Operator (demo 49) to detect vulnerabilities
Runtime monitoring: Use Falco (demo 50) to detect anomalous behavior
Policy enforcement: Use Kyverno (demo 42) to enforce these settings cluster-wide
Pod Security Standards: Enforce Restricted standard at the namespace level (demo 24)
Secret management: Use Vault (demo 28) or External Secrets (demo 29) instead of environment variables
TLS: Terminate TLS at the Ingress or use cert-manager (demo 05) for pod-to-pod encryption
Resource limits with monitoring: The demo has minimal limits. Production needs right-sized limits based on actual usage.

Trade-offs and Alternatives

Minimal Images vs Feature-Rich Images

Minimal images (alpine, distroless) have fewer packages and smaller attack surfaces. But they make debugging harder. No shell, no package manager, no standard tools.

Feature-rich images (ubuntu, debian) make debugging easier but include thousands of binaries you do not need. Each binary is a potential attack vector.

A middle ground: use alpine or distroless in production, but keep a full-featured image available for kubectl debug.

Init Containers for Setup

Some applications need to write configuration files or set up directories at startup. With readOnlyRootFilesystem: true, this fails.

Solution: use an init container that runs as root, does the setup, and writes to a shared volume. The main container runs as non-root with a read-only filesystem and reads the configuration from the volume.

Service Mesh for mTLS

This demo has no encryption. Traffic between pods is plaintext. A service mesh (Istio, Linkerd) can add mutual TLS without changing application code. See demo 41 for Istio.

Common Pitfalls

Pitfall 1: Forgetting Writable Volumes

Setting readOnlyRootFilesystem: true without mounting writable volumes causes crashes. The application tries to write a temp file and fails with “Read-only file system.”

Check the application’s documentation or run it with strace to see which directories it writes to. Mount emptyDir volumes at those paths.

Pitfall 2: Using Root-Only Images

Some images cannot run as non-root without modification. They bind to privileged ports, install packages at startup, or expect to write to root-owned directories.

Solutions:

Find an unprivileged variant (nginx-unprivileged, bitnami/redis, etc.)
Rebuild the image to run as non-root
Use an init container to do privileged setup, then run the main container as non-root

Pitfall 3: Conflicting UIDs

If the image expects to run as UID 1000 but you set runAsUser: 101, file permissions may break. The process runs as UID 101 but files are owned by UID 1000.

Solution: match the UID in the securityContext to the UID the image expects. Check the Dockerfile for the USER directive.

Pitfall 4: Capabilities vs Ports

Dropping all capabilities removes CAP_NET_BIND_SERVICE, which allows binding to ports below 1024. If your application listens on port 80, it will fail.

Solutions:

Change the application to listen on a high port (8080, 3000, etc.)
Add back CAP_NET_BIND_SERVICE (less secure)
Use a Kubernetes Service to map port 80 to the container’s high port

Hello Security: Deep Dive

Why Secure Defaults Matter

The Container Attack Surface

Containers Are Not Virtual Machines

How Each Setting Protects You

runAsNonRoot and runAsUser

readOnlyRootFilesystem with Writable Volumes

allowPrivilegeEscalation

capabilities.drop: [ALL]

seccompProfile: RuntimeDefault

Why Use nginxinc/nginx-unprivileged?

Security Context Hierarchy

The fsGroup Field

Security Considerations

1. emptyDir Volumes Lack sizeLimit

2. Image Uses Tag Instead of Digest Pinning

3. No NetworkPolicy Present

What Would Change in Production

Trade-offs and Alternatives

Minimal Images vs Feature-Rich Images

Init Containers for Setup

Service Mesh for mTLS

Common Pitfalls

Pitfall 1: Forgetting Writable Volumes

Pitfall 2: Using Root-Only Images

Pitfall 3: Conflicting UIDs

Pitfall 4: Capabilities vs Ports

Further Reading

See Also