DaemonSet: Deep Dive
This document explains how DaemonSets schedule pods on every node, why their update strategies differ from Deployments, and when to use tolerations, node selectors, and host access. It connects the demo manifests to production patterns for logging, monitoring, and networking.
What DaemonSets Guarantee
Section titled “What DaemonSets Guarantee”A DaemonSet ensures that every node (or a selected subset) runs exactly one copy of a pod. When a new node joins the cluster, the DaemonSet controller automatically schedules a pod on it. When a node is removed, the pod is garbage collected.
This is fundamentally different from Deployments and StatefulSets, which scale by replica count regardless of node topology. A Deployment with 3 replicas might land all 3 pods on the same node. A DaemonSet always produces exactly one pod per qualifying node.
How DaemonSet Scheduling Works
Section titled “How DaemonSet Scheduling Works”The Old Way (Pre-1.12)
Section titled “The Old Way (Pre-1.12)”Before Kubernetes 1.12, the DaemonSet controller bypassed the default scheduler entirely. It
set the nodeName field directly on the pod spec, which forced the pod onto a specific node
without going through the scheduler.
This caused problems. The pod skipped all scheduler predicates (resource checks, affinity rules, taints). It could land on nodes that were already overcommitted.
The Current Way
Section titled “The Current Way”Since Kubernetes 1.12, DaemonSet pods go through the default scheduler. The DaemonSet
controller creates pods with a nodeAffinity that targets a specific node:
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchFields: - key: metadata.name operator: In values: ["node-1"]The scheduler then evaluates this pod like any other. It checks resource availability, taints, and other constraints. If the node cannot accommodate the pod, the pod stays Pending.
This approach integrates DaemonSets with the scheduler’s priority and preemption system. A high-priority DaemonSet pod can preempt lower-priority pods on a node.
Tolerations and Taints
Section titled “Tolerations and Taints”Taints are node-level markers that repel pods. Tolerations are pod-level declarations that allow a pod to run on tainted nodes.
The demo’s node-monitor uses a toleration to run on control-plane nodes:
apiVersion: apps/v1kind: DaemonSetmetadata: name: node-monitor namespace: daemonset-demospec: selector: matchLabels: app: node-monitor template: metadata: labels: app: node-monitor spec: tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: monitor image: busybox:1.36 command: ["/bin/sh", "/scripts/monitor.sh"] volumeMounts: - name: scripts mountPath: /scripts - name: host-log mountPath: /var/log/containers readOnly: trueHow Taints and Tolerations Interact
Section titled “How Taints and Tolerations Interact”Control-plane nodes typically carry this taint:
node-role.kubernetes.io/control-plane:NoScheduleWithout a matching toleration, no regular pod can be scheduled there. The node-monitor’s
toleration says: “I accept nodes tainted with node-role.kubernetes.io/control-plane,
regardless of the taint value.”
The operator: Exists means the toleration matches the key regardless of value. The
effect: NoSchedule means it only tolerates the NoSchedule effect, not NoExecute.
Common Taint Effects
Section titled “Common Taint Effects”| Effect | Behavior |
|---|---|
NoSchedule | New pods are not scheduled. Existing pods stay. |
PreferNoSchedule | Scheduler avoids the node but does not guarantee it. |
NoExecute | New pods are not scheduled. Existing pods are evicted. |
DaemonSet-Specific Tolerations
Section titled “DaemonSet-Specific Tolerations”The DaemonSet controller automatically adds several tolerations to pods it creates:
node.kubernetes.io/not-ready:NoExecute(tolerate not-ready nodes)node.kubernetes.io/unreachable:NoExecute(tolerate unreachable nodes)node.kubernetes.io/disk-pressure:NoSchedulenode.kubernetes.io/memory-pressure:NoSchedulenode.kubernetes.io/pid-pressure:NoSchedulenode.kubernetes.io/unschedulable:NoSchedule
These are added automatically because DaemonSets need to run on every node, even nodes under pressure. A monitoring agent is most useful precisely when a node is having problems.
Node Selection: nodeSelector vs Node Affinity
Section titled “Node Selection: nodeSelector vs Node Affinity”nodeSelector
Section titled “nodeSelector”The simplest way to restrict a DaemonSet to specific nodes. It uses label matching:
spec: template: spec: nodeSelector: disk: ssdOnly nodes with the label disk=ssd get a pod. This is an AND operation. All specified labels
must match.
Node Affinity
Section titled “Node Affinity”A more expressive alternative. Node affinity supports In, NotIn, Exists, DoesNotExist,
Gt, and Lt operators:
spec: template: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/os operator: In values: ["linux"] - key: node.kubernetes.io/instance-type operator: In values: ["m5.large", "m5.xlarge"]Node affinity also supports preferredDuringSchedulingIgnoredDuringExecution, which
expresses a preference without making it a hard requirement.
When to Use Which
Section titled “When to Use Which”Use nodeSelector for simple label matches. Use node affinity when you need OR logic
(multiple nodeSelectorTerms), negative matching (NotIn, DoesNotExist), or soft
preferences.
Update Strategies
Section titled “Update Strategies”RollingUpdate (Default)
Section titled “RollingUpdate (Default)”The demo’s log-collector uses a rolling update:
apiVersion: apps/v1kind: DaemonSetmetadata: name: log-collector namespace: daemonset-demospec: updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 template: metadata: labels: app: log-collector spec: containers: - name: collector image: busybox:1.36When you update the pod template (change the image, add an env var, etc.), the DaemonSet controller rolls out the change one node at a time. It terminates the old pod on a node, waits for it to be fully gone, then creates the new pod.
maxUnavailable: 1 means at most 1 node at a time lacks a running DaemonSet pod during the
rollout. You can increase this to speed up large rollouts:
rollingUpdate: maxUnavailable: 25%On a 100-node cluster, this updates 25 nodes simultaneously.
maxSurge
Section titled “maxSurge”Starting in Kubernetes 1.22, DaemonSet RollingUpdate supports maxSurge:
rollingUpdate: maxSurge: 1 maxUnavailable: 0With maxSurge: 1, the controller creates the new pod before deleting the old one. This
means a node temporarily runs two DaemonSet pods. The old pod is removed only after the new
one is Ready.
This is useful for zero-downtime updates of DaemonSet workloads. Without maxSurge, there is
always a gap between the old pod terminating and the new pod starting.
Note: maxSurge and maxUnavailable cannot both be zero. At least one must be positive.
OnDelete
Section titled “OnDelete”With OnDelete, the DaemonSet controller never automatically replaces pods. You must manually
delete each pod to trigger its replacement:
spec: updateStrategy: type: OnDeleteThis gives you full control over when each node gets updated. It is useful for critical infrastructure like CNI plugins where a bad update could take down networking.
Host Access: hostPath, hostNetwork, hostPID
Section titled “Host Access: hostPath, hostNetwork, hostPID”DaemonSet pods often need access to the host system. The demo uses hostPath to read container
logs from the node filesystem:
volumes: - name: host-log hostPath: path: /var/log/containers type: DirectoryhostPath
Section titled “hostPath”Mounts a file or directory from the host node’s filesystem into the pod. Common use cases:
| Path | Purpose |
|---|---|
/var/log | Node and container logs |
/var/log/containers | Container log files |
/sys | Kernel parameters and hardware info |
/proc | Process information |
/etc/machine-id | Unique node identifier |
/var/run/docker.sock | Container runtime socket (legacy) |
The type field validates the path:
| Type | Behavior |
|---|---|
"" | No check (default) |
DirectoryOrCreate | Creates the directory if it does not exist |
Directory | Must be an existing directory |
FileOrCreate | Creates the file if it does not exist |
File | Must be an existing file |
Socket | Must be an existing Unix socket |
CharDevice | Must be an existing character device |
BlockDevice | Must be an existing block device |
hostNetwork
Section titled “hostNetwork”spec: template: spec: hostNetwork: trueThe pod uses the host’s network namespace. It shares the node’s IP address and can bind to host ports directly. This is used by CNI plugins and some monitoring agents that need to see all network traffic on the node.
hostPID
Section titled “hostPID”spec: template: spec: hostPID: trueThe pod shares the host’s PID namespace. It can see all processes on the node. This is useful for monitoring agents that need to inspect process trees or send signals to host processes.
Security Implications
Section titled “Security Implications”All three (hostPath, hostNetwork, hostPID) break pod isolation. A pod with hostPath
can read sensitive files. A pod with hostNetwork can bind to any port. A pod with hostPID
can see all processes.
In production, these should be paired with:
- SecurityContext: Run as non-root, drop capabilities.
- PodSecurityAdmission: Use
restrictedorbaselineprofiles. - Read-only mounts: Set
readOnly: trueonhostPathvolume mounts. - RBAC: Restrict which ServiceAccounts can create pods with host access.
The demo’s log-collector mounts /var/log/containers as read-only:
volumeMounts: - name: varlog mountPath: /var/log/containers readOnly: trueResource Requests and Limits
Section titled “Resource Requests and Limits”DaemonSet pods compete for resources with application pods. On a busy node, a DaemonSet pod without resource requests might get evicted or starved.
Both demo DaemonSets set conservative resource requests:
resources: requests: cpu: 25m memory: 16Mi limits: cpu: 50m memory: 32MiThis reserves a small slice of the node for monitoring. The low limits prevent a runaway monitoring script from consuming excessive resources.
In production, set requests based on observed usage. A Fluentd log collector might need 200m CPU and 256Mi memory. An underfunded Fluentd will drop logs under load.
Production Patterns
Section titled “Production Patterns”Log Collection (Fluentd, Fluent Bit, Vector)
Section titled “Log Collection (Fluentd, Fluent Bit, Vector)”The most common DaemonSet use case. A log collector runs on every node, reads container logs from the node filesystem, and ships them to a central system (Elasticsearch, Loki, CloudWatch).
Key design points:
- Mount
/var/logand/var/lib/docker/containers(or/var/log/podsfor CRI-based runtimes). - Use
readOnly: truefor safety. - Set memory limits carefully. Log collectors can buffer large volumes of data.
- Tolerate all taints so logs are collected from every node, including control-plane nodes.
Monitoring Agents (Prometheus Node Exporter, Datadog Agent)
Section titled “Monitoring Agents (Prometheus Node Exporter, Datadog Agent)”A monitoring agent collects node-level metrics: CPU, memory, disk, network. It exposes a metrics endpoint that Prometheus scrapes.
Key design points:
- Mount
/procand/sysfor system metrics. - Use
hostNetwork: trueif the agent needs to see node-level network metrics. - Use
hostPID: trueif the agent needs to see all processes. - Set appropriate resource requests. Monitoring should not compete with application pods.
CNI Plugins (Calico, Cilium, Flannel)
Section titled “CNI Plugins (Calico, Cilium, Flannel)”Container Network Interface plugins run as DaemonSets. They configure networking for every pod on the node.
Key design points:
- Use
hostNetwork: truebecause the CNI plugin manages the network itself. - Use
OnDeleteupdate strategy because a broken CNI update can take down all networking on the node. - Set
priorityClassName: system-node-criticalto ensure the CNI pod is never evicted. - Tolerate all taints, including
NoExecute.
Storage Drivers (CSI Node Plugins)
Section titled “Storage Drivers (CSI Node Plugins)”CSI node plugins run as DaemonSets. They handle mounting and unmounting volumes on each node.
Key design points:
- Mount the host’s
/var/lib/kubeletdirectory. - Use
privileged: truesecurity context for mount operations. - Set
priorityClassName: system-node-critical.
DaemonSet vs Deployment with Anti-Affinity
Section titled “DaemonSet vs Deployment with Anti-Affinity”You could approximate DaemonSet behavior with a Deployment that has pod anti-affinity and a replica count matching the node count. But this is fragile:
- You must manually adjust the replica count when nodes are added or removed.
- Pod anti-affinity is a scheduling hint, not a guarantee (for
preferredmode). - The scheduler does not understand “one per node” as a first-class concept.
DaemonSets handle node topology natively. They track node membership and reconcile automatically. Use them whenever you need exactly one pod per node.
DaemonSet Controller Internals
Section titled “DaemonSet Controller Internals”The DaemonSet controller runs inside kube-controller-manager. On each reconciliation:
- List all nodes. Determine which nodes qualify (based on nodeSelector, affinity, taints).
- List all DaemonSet pods. Find pods owned by this DaemonSet.
- Compare. For each qualifying node, check if a pod exists.
- Create missing pods. If a qualifying node has no pod, create one.
- Delete extra pods. If a non-qualifying node has a pod (label changed, taint added), delete it.
- Handle updates. If the pod template has changed and the update strategy is
RollingUpdate, replace pods according tomaxUnavailableandmaxSurge.
The controller also watches for node events (new node, node deletion, label changes) to trigger reconciliation immediately.
Priority and Preemption
Section titled “Priority and Preemption”DaemonSet pods should use priorityClassName to ensure they are not evicted by application
pods:
spec: template: spec: priorityClassName: system-node-criticalBuilt-in priority classes:
| Priority Class | Value | Purpose |
|---|---|---|
system-cluster-critical | 2000000000 | Cluster-level infrastructure |
system-node-critical | 2000001000 | Node-level infrastructure |
DaemonSet pods with system-node-critical priority can preempt lower-priority pods. This
ensures critical node infrastructure (logging, monitoring, networking) always has resources.
Rollback
Section titled “Rollback”DaemonSets maintain revision history, similar to Deployments. You can roll back to a previous version:
# Check revision historykubectl rollout history daemonset/node-monitor -n daemonset-demo
# Roll back to the previous revisionkubectl rollout undo daemonset/node-monitor -n daemonset-demo
# Roll back to a specific revisionkubectl rollout undo daemonset/node-monitor --to-revision=2 -n daemonset-demoThe controller applies the old pod template and rolls out the change using the configured update strategy.
Connection to the Demo
Section titled “Connection to the Demo”The demo manifests illustrate two common patterns:
-
node-monitor: A monitoring agent that tolerates control-plane taints and mounts host paths for log counting. It shows how to access node-level information from a pod.
-
log-collector: A log shipper that uses
RollingUpdatewithmaxUnavailable: 1and mounts/var/log/containersread-only. It shows the minimal setup for collecting container logs.
Both DaemonSets run one pod per node. On a single-node minikube cluster, you see one pod each.
Adding a second node with minikube node add demonstrates automatic scheduling.
Common Pitfalls
Section titled “Common Pitfalls”Missing Tolerations
Section titled “Missing Tolerations”If your DaemonSet pod is missing from a node, check the node’s taints:
kubectl describe node <node-name> | grep TaintsAdd matching tolerations to the DaemonSet pod spec.
Resource Starvation
Section titled “Resource Starvation”DaemonSet pods without resource requests can be evicted under memory pressure. Always set resource requests, even if they are small.
hostPath Permissions
Section titled “hostPath Permissions”Some host paths require root access. If the container runs as non-root, it may get permission
denied errors when reading /proc or /sys. Use securityContext.runAsUser: 0 or adjust
file permissions.
Update Strategy Mismatch
Section titled “Update Strategy Mismatch”Using RollingUpdate for a CNI plugin can be dangerous. If the new version has a bug, nodes
lose networking as the old pod is replaced. Use OnDelete for critical infrastructure and
test updates manually.