Hello Security: Deep Dive
Why Secure Defaults Matter
Section titled “Why Secure Defaults Matter”By default, containers run with far more privileges than they need. The official nginx image runs as root, binds to port 80, and has write access to its entire filesystem. This is convenient for getting started but dangerous in production.
Before Kubernetes had Pod Security Standards, every team reinvented security baselines. Some teams ran everything as root because it was easier. Others locked down containers but missed important fields. Security became inconsistent across the organization.
This demo shows the five essential security settings that every production container should have. These are not optional hardening steps. They are the baseline. The Restricted Pod Security Standard requires all of them.
The demo deployment passed the Restricted standard in testing. Two pods started in under 3 seconds. The service responded normally. The security settings did not break functionality. They just removed privileges the application never needed.
The Container Attack Surface
Section titled “The Container Attack Surface”Understanding why these settings matter requires understanding how container escapes work.
Containers Are Not Virtual Machines
Section titled “Containers Are Not Virtual Machines”Containers share the host kernel. A container process is just a Linux process with namespaces and cgroups applied. If a container runs as root (UID 0), it is the same root user as on the host. The only thing separating it from the host is the namespace isolation.
Namespace isolation is strong, but not perfect. Kernel vulnerabilities sometimes allow containers to break out. Many of these exploits require specific conditions:
- Container runs as root
- Container has dangerous capabilities (SYS_ADMIN, SYS_PTRACE)
- Container can make dangerous syscalls (mount, ptrace, reboot)
- Container can modify its own filesystem
The demo’s security settings eliminate these conditions. An attacker who compromises the nginx process gets a non-root shell in a read-only filesystem with no capabilities and limited syscalls. Privilege escalation becomes extremely difficult.
How Each Setting Protects You
Section titled “How Each Setting Protects You”runAsNonRoot and runAsUser
Section titled “runAsNonRoot and runAsUser”# From manifests/deployment.yamlspec: securityContext: runAsNonRoot: true runAsUser: 101 runAsGroup: 101 fsGroup: 101The runAsNonRoot field is a safety check. It tells Kubernetes to reject the pod if the container image would run as UID 0. This prevents accidental deployments of root containers.
The runAsUser field explicitly sets the UID to 101. This is the nginx user in the nginxinc/nginx-unprivileged image.
Why UID 101 specifically? Container images define default users. The unprivileged nginx image creates a user named nginx with UID 101. We match that UID in the securityContext. If the values conflict (image expects UID 101 but we set UID 1000), the container starts as UID 1000 but may fail if it tries to read files owned by UID 101.
What this blocks:
Running as non-root limits the blast radius of code execution. Many container escapes rely on being root inside the container. The runC CVE-2019-5736 exploit allowed attackers to overwrite the host runC binary, but only if the container ran as root. Non-root containers were immune.
Even without kernel exploits, root can do more damage. Root can read sensitive files (if they leak into the container via misconfigured volumes), modify application code, and install malware.
readOnlyRootFilesystem with Writable Volumes
Section titled “readOnlyRootFilesystem with Writable Volumes”# From manifests/deployment.yamlcontainers: - name: nginx securityContext: readOnlyRootFilesystem: true volumeMounts: - name: cache mountPath: /var/cache/nginx - name: run mountPath: /var/run - name: tmp mountPath: /tmpvolumes: - name: cache emptyDir: {} - name: run emptyDir: {} - name: tmp emptyDir: {}A read-only root filesystem prevents all writes except to mounted volumes. The container image layers are immutable. An attacker cannot:
- Install backdoors or malware
- Modify the nginx binary
- Overwrite configuration files
- Create persistence mechanisms
- Tamper with logs (if written to the filesystem instead of stdout)
But nginx needs to write temporary files and its PID file. We mount emptyDir volumes at three paths:
/var/cache/nginx: Proxy cache, fastcgi temp files, client body temp files/var/run: nginx.pid file/tmp: General temporary files
These volumes are ephemeral. They disappear when the pod is deleted. An attacker can write to them, but the changes do not survive a pod restart.
What this blocks:
Many persistence techniques rely on writing to the filesystem. An attacker might replace a binary with a trojaned version, create a cron job, or write a startup script that runs on reboot. Read-only filesystems eliminate these techniques.
The security audit flagged that our emptyDir volumes lack sizeLimit. Without a limit, an attacker who gains code execution could fill the volume and exhaust node disk space. In production, add limits:
volumes: - name: cache emptyDir: sizeLimit: 100Mi - name: run emptyDir: sizeLimit: 10Mi - name: tmp emptyDir: sizeLimit: 50MiThese limits prevent a compromised container from impacting other workloads on the node.
allowPrivilegeEscalation
Section titled “allowPrivilegeEscalation”# From manifests/deployment.yamlsecurityContext: allowPrivilegeEscalation: falseThis field controls the Linux no_new_privs flag. When set to false, the kernel prevents a process from gaining more privileges than its parent.
Specifically, this neutralizes setuid and setgid binaries. A setuid binary runs with the file owner’s privileges instead of the caller’s privileges. The classic example is /usr/bin/sudo, which is owned by root with the setuid bit set. When a normal user runs sudo, the process gains root privileges.
With allowPrivilegeEscalation: false, setuid and setgid bits are ignored. Even if an attacker finds a setuid binary in the container image (perhaps left over from the base image), they cannot use it to escalate to root.
What this blocks:
Setuid binaries are a common privilege escalation vector. Vulnerabilities in sudo, su, passwd, and other setuid programs have been used to gain root access for decades.
In containers, these binaries should not exist. But base images sometimes include them. allowPrivilegeEscalation: false ensures they cannot be abused.
Kubernetes automatically sets this to false when you drop all capabilities (as of v1.25+). But it is good practice to set it explicitly.
capabilities.drop: [ALL]
Section titled “capabilities.drop: [ALL]”# From manifests/deployment.yamlsecurityContext: capabilities: drop: - ALLLinux capabilities divide root’s powers into discrete units. Instead of all-or-nothing root, a process can have specific capabilities.
By default, container runtimes grant these capabilities:
| Capability | What It Allows |
|---|---|
CAP_CHOWN | Change file ownership |
CAP_NET_BIND_SERVICE | Bind to ports below 1024 |
CAP_SETUID / CAP_SETGID | Change process UID/GID |
CAP_DAC_OVERRIDE | Bypass file permission checks |
CAP_KILL | Send signals to other processes |
CAP_NET_RAW | Use raw sockets (ping, packet sniffing) |
Dropping ALL removes even these defaults. The process has no special powers. It can only do what an unprivileged user can do.
When to add capabilities back:
Some applications need specific capabilities. A DNS server might need NET_BIND_SERVICE to listen on port 53. A monitoring tool might need NET_RAW for packet capture.
In those cases, drop all and add back only what is required:
capabilities: drop: - ALL add: - NET_BIND_SERVICENever add capabilities speculatively. Only add them when the application fails without them, and only add the minimum needed.
What this blocks:
Many capabilities enable dangerous operations. CAP_SYS_ADMIN is near-root power (mount filesystems, configure namespaces, trace processes). CAP_SYS_PTRACE allows reading other processes’ memory, which can leak secrets. CAP_NET_ADMIN allows modifying firewall rules and network routing.
The default capabilities are less dangerous but still powerful. CAP_NET_RAW enables packet sniffing. CAP_SETUID allows changing to any user. Removing them shrinks the attack surface.
seccompProfile: RuntimeDefault
Section titled “seccompProfile: RuntimeDefault”# From manifests/deployment.yamlspec: securityContext: seccompProfile: type: RuntimeDefaultSeccomp (secure computing mode) restricts which system calls a process can make. A system call is how user-space code asks the kernel to do something (open a file, allocate memory, create a network socket).
Linux has over 300 syscalls. Most applications use fewer than 100. The RuntimeDefault profile blocks approximately 44 dangerous syscalls:
reboot,swapon,swapoff: System-level operationsmount,umount,pivot_root: Filesystem manipulationptrace: Process tracing and debuggingclock_settime: Changing the system clockkexec_load: Loading a new kernel
These syscalls are rarely needed in containers and are common in container escape exploits.
Custom seccomp profiles:
For even stricter security, create custom profiles that allow only the syscalls your application uses. Tools like strace can record syscalls during testing:
strace -c -f -S name nginx 2>&1 | tail -n +3 | head -n -2 | awk '{print $NF}'This generates a list of syscalls nginx makes. Build a seccomp profile that allows only those calls.
The Security Profiles Operator can manage custom profiles as Kubernetes resources, making them easier to deploy and update.
What this blocks:
Many kernel exploits rely on obscure syscalls. The ptrace syscall allows inspecting and modifying other processes, which can leak secrets. The mount syscall is used in some container escape techniques. Blocking these syscalls closes attack vectors.
Why Use nginxinc/nginx-unprivileged?
Section titled “Why Use nginxinc/nginx-unprivileged?”The official nginx image has two problems for secure deployments:
- Runs as root by default: The master process runs as UID 0. Worker processes run as the nginx user, but the master needs root to bind to port 80.
- Listens on port 80: Ports below 1024 require the
CAP_NET_BIND_SERVICEcapability (or root).
The nginxinc/nginx-unprivileged image solves both:
- All processes run as UID 101 (nginx user)
- Listens on port 8080 instead of 80
- Designed to work with
readOnlyRootFilesystem: true - Configuration files are in writable directories by default
This makes it compatible with the Restricted Pod Security Standard out of the box.
Other common images have unprivileged variants: bitnami/redis, bitnami/postgresql, bitnami/mongodb. Use these in production.
Security Context Hierarchy
Section titled “Security Context Hierarchy”Security settings can be specified at two levels:
Pod level (spec.securityContext):
runAsUser,runAsGroup,runAsNonRootfsGroup,fsGroupChangePolicyseccompProfilesupplementalGroups,sysctls
Container level (spec.containers[*].securityContext):
runAsUser,runAsGroup,runAsNonRootreadOnlyRootFilesystemallowPrivilegeEscalationcapabilitiesprivilegedseccompProfile,seLinuxOptions,appArmorProfile
Container-level settings override pod-level settings. In the demo, seccompProfile is set at the pod level (applies to all containers), while capabilities and readOnlyRootFilesystem are set per-container.
Best practice: Set user and group at the pod level. Different containers in the same pod usually run as the same user. Set filesystem and capability restrictions per-container, since different containers may have different needs.
The fsGroup Field
Section titled “The fsGroup Field”# From manifests/deployment.yamlspec: securityContext: fsGroup: 101The fsGroup field sets the group ownership of mounted volumes. When a PersistentVolume is mounted, files are owned by fsGroup. Processes in the pod can read and write files if they belong to that group.
In the demo, the emptyDir volumes inherit fsGroup 101. The nginx process runs as UID 101, GID 101, so it can write to /var/cache/nginx and /var/run.
Without fsGroup, the volumes might be owned by root (UID 0, GID 0). The nginx process would fail with permission denied errors when trying to write its PID file.
Security Considerations
Section titled “Security Considerations”The security audit found this demo meets the Restricted Pod Security Standard. All required controls are in place. The demo is a reference implementation of secure container defaults.
The audit flagged three informational items:
1. emptyDir Volumes Lack sizeLimit
Section titled “1. emptyDir Volumes Lack sizeLimit”The emptyDir volumes have no size limits. An attacker who gains code execution could fill these volumes and exhaust node disk space.
In production, add limits:
volumes: - name: cache emptyDir: sizeLimit: 100MiThe right limit depends on the application. Nginx’s proxy cache can grow large if caching is enabled. A simple static file server needs very little space.
2. Image Uses Tag Instead of Digest Pinning
Section titled “2. Image Uses Tag Instead of Digest Pinning”The deployment uses nginxinc/nginx-unprivileged:1.25-alpine. This is a mutable tag. The image behind the tag can change.
For maximum reproducibility and security, pin to a digest:
image: nginxinc/nginx-unprivileged:1.25-alpine@sha256:abc123...Digest pinning ensures the exact same image runs every time. Tags like 1.25-alpine can be overwritten (accidentally or maliciously).
The trade-off: digest pinning makes updates harder. You must explicitly update the digest to get security patches. Many teams prefer tags with automated image scanning and update tools (like Renovate or Dependabot).
3. No NetworkPolicy Present
Section titled “3. No NetworkPolicy Present”The demo has no NetworkPolicy. The nginx pods can send traffic to any destination and receive traffic from any source.
NetworkPolicies are out of scope for this demo. They are orthogonal to container security settings. But in production, apply a default-deny NetworkPolicy and explicitly allow only required traffic.
See demo 19 for NetworkPolicy patterns.
What Would Change in Production
Section titled “What Would Change in Production”This demo focuses on container-level security. Production deployments need additional layers:
- NetworkPolicy: Restrict ingress and egress traffic
- Image scanning: Use Trivy Operator (demo 49) to detect vulnerabilities
- Runtime monitoring: Use Falco (demo 50) to detect anomalous behavior
- Policy enforcement: Use Kyverno (demo 42) to enforce these settings cluster-wide
- Pod Security Standards: Enforce Restricted standard at the namespace level (demo 24)
- Secret management: Use Vault (demo 28) or External Secrets (demo 29) instead of environment variables
- TLS: Terminate TLS at the Ingress or use cert-manager (demo 05) for pod-to-pod encryption
- Resource limits with monitoring: The demo has minimal limits. Production needs right-sized limits based on actual usage.
Trade-offs and Alternatives
Section titled “Trade-offs and Alternatives”Minimal Images vs Feature-Rich Images
Section titled “Minimal Images vs Feature-Rich Images”Minimal images (alpine, distroless) have fewer packages and smaller attack surfaces. But they make debugging harder. No shell, no package manager, no standard tools.
Feature-rich images (ubuntu, debian) make debugging easier but include thousands of binaries you do not need. Each binary is a potential attack vector.
A middle ground: use alpine or distroless in production, but keep a full-featured image available for kubectl debug.
Init Containers for Setup
Section titled “Init Containers for Setup”Some applications need to write configuration files or set up directories at startup. With readOnlyRootFilesystem: true, this fails.
Solution: use an init container that runs as root, does the setup, and writes to a shared volume. The main container runs as non-root with a read-only filesystem and reads the configuration from the volume.
Service Mesh for mTLS
Section titled “Service Mesh for mTLS”This demo has no encryption. Traffic between pods is plaintext. A service mesh (Istio, Linkerd) can add mutual TLS without changing application code. See demo 41 for Istio.
Common Pitfalls
Section titled “Common Pitfalls”Pitfall 1: Forgetting Writable Volumes
Section titled “Pitfall 1: Forgetting Writable Volumes”Setting readOnlyRootFilesystem: true without mounting writable volumes causes crashes. The application tries to write a temp file and fails with “Read-only file system.”
Check the application’s documentation or run it with strace to see which directories it writes to. Mount emptyDir volumes at those paths.
Pitfall 2: Using Root-Only Images
Section titled “Pitfall 2: Using Root-Only Images”Some images cannot run as non-root without modification. They bind to privileged ports, install packages at startup, or expect to write to root-owned directories.
Solutions:
- Find an unprivileged variant (nginx-unprivileged, bitnami/redis, etc.)
- Rebuild the image to run as non-root
- Use an init container to do privileged setup, then run the main container as non-root
Pitfall 3: Conflicting UIDs
Section titled “Pitfall 3: Conflicting UIDs”If the image expects to run as UID 1000 but you set runAsUser: 101, file permissions may break. The process runs as UID 101 but files are owned by UID 1000.
Solution: match the UID in the securityContext to the UID the image expects. Check the Dockerfile for the USER directive.
Pitfall 4: Capabilities vs Ports
Section titled “Pitfall 4: Capabilities vs Ports”Dropping all capabilities removes CAP_NET_BIND_SERVICE, which allows binding to ports below 1024. If your application listens on port 80, it will fail.
Solutions:
- Change the application to listen on a high port (8080, 3000, etc.)
- Add back
CAP_NET_BIND_SERVICE(less secure) - Use a Kubernetes Service to map port 80 to the container’s high port
Further Reading
Section titled “Further Reading”- Kubernetes Pod Security Standards
- Linux Capabilities Manual
- Seccomp in Kubernetes
- CIS Kubernetes Benchmark
- NIST Application Container Security Guide
- runC CVE-2019-5736 Explained
- Nginx Unprivileged Image Documentation
See Also
Section titled “See Also”- Pod Security Standards for namespace-level enforcement
- Kyverno for policy-based enforcement
- Trivy Operator for vulnerability scanning
- Falco for runtime threat detection