Skip to content

Event-Driven Architecture (Kafka): Deep Dive

This demo runs Apache Kafka and Zookeeper as StatefulSets in Kubernetes. A producer sends timestamped messages to a Kafka topic every 5 seconds. A consumer reads those messages in real time. The producer and consumer do not know about each other. Kafka sits between them, handling storage, ordering, and delivery.

This document explains why event-driven architecture exists, how Kafka works internally, how it differs from traditional message queues, and what it takes to run Kafka reliably on Kubernetes.

In the previous demo, the backend serves API requests synchronously. A client sends a request. The backend responds. Done. This works until it does not.

Synchronous communication creates temporal coupling. Both parties must be online at the same time. If the backend is down, the request fails. If the backend is slow, the client waits.

Event-driven architecture removes that coupling. The producer publishes an event. The broker stores it. The consumer reads it when ready. If the consumer is offline, events accumulate in the broker. When the consumer comes back, it catches up.

This pattern is essential for:

  • Order processing pipelines where multiple systems react to a single event
  • Audit logging where events must be durable and ordered
  • Real-time analytics where high-volume data streams from many sources
  • Integration between systems owned by different teams

Kafka is a distributed commit log. It is not a traditional message queue. Understanding that distinction is key.

A Kafka cluster consists of one or more brokers. Each broker is a server that stores data and serves clients. This demo runs a single broker:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
namespace: kafka-demo
spec:
serviceName: kafka
replicas: 1
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: docker.io/bitnami/kafka:3.6
ports:
- containerPort: 9092
name: kafka
env:
- name: KAFKA_CFG_ZOOKEEPER_CONNECT
value: zookeeper:2181
- name: KAFKA_CFG_LISTENERS
value: PLAINTEXT://:9092
- name: KAFKA_CFG_ADVERTISED_LISTENERS
value: PLAINTEXT://kafka-0.kafka.kafka-demo.svc.cluster.local:9092
- name: KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
value: "true"
- name: ALLOW_PLAINTEXT_LISTENER
value: "yes"

In production, you run 3 or more brokers for fault tolerance. Data is replicated across brokers so that no single failure loses messages.

A topic is a named stream of events. This demo uses a topic called events. The producer writes to it. The consumer reads from it.

Topics are created automatically because KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE is set to true. In production, you usually create topics explicitly with a defined number of partitions and a replication factor:

Terminal window
kafka-topics.sh --create --topic orders \
--partitions 3 --replication-factor 1 \
--bootstrap-server localhost:9092

A topic is divided into partitions. Each partition is an ordered, immutable sequence of messages. Partitions are the unit of parallelism in Kafka.

With one partition, messages are strictly ordered. With three partitions, Kafka distributes messages across them. Ordering is guaranteed within a partition but not across partitions.

Topic: events
Partition 0: [msg1, msg4, msg7, ...]
Partition 1: [msg2, msg5, msg8, ...]
Partition 2: [msg3, msg6, msg9, ...]

Each partition lives on one broker (with replicas on other brokers). Adding partitions allows more consumers to read in parallel, but you cannot reduce the number of partitions after creation.

Consumers belong to a group. Kafka assigns each partition to exactly one consumer in the group. If you have 3 partitions and 3 consumers in the same group, each consumer reads from one partition.

If a consumer crashes, Kafka reassigns its partition to another consumer in the group. This is called rebalancing.

The consumer in this demo uses kafka-console-consumer.sh with --from-beginning, which reads all messages from the start of the topic. In production, consumers track their position using offsets and resume from where they left off.

Each message in a partition has a numeric offset. Offset 0 is the first message. Offset 1 is the second. Offsets are never reused.

Consumers commit their current offset to Kafka. If a consumer restarts, it reads from the last committed offset. This is how Kafka provides at-least-once delivery: the consumer processes a message, then commits its offset. If it crashes before committing, it reprocesses the message after restart.

The producer sends a message every 5 seconds:

containers:
- name: producer
image: docker.io/bitnami/kafka:3.6
command:
- /bin/bash
- -c
- |
echo "Waiting for Kafka to be ready..."
sleep 30
echo "Starting producer, sending to topic 'events'..."
count=0
while true; do
count=$((count + 1))
message="event-${count}: order-placed at $(date '+%Y-%m-%d %H:%M:%S')"
echo "$message" | kafka-console-producer.sh \
--broker-list kafka-0.kafka.kafka-demo.svc.cluster.local:9092 \
--topic events
echo "[producer] Sent: $message"
sleep 5
done

Two things to notice. First, the producer uses the StatefulSet DNS name kafka-0.kafka.kafka-demo.svc.cluster.local to reach the broker. This is stable across pod restarts. Second, the producer sleeps 30 seconds before starting to give Kafka time to initialize.

In production, producers care deeply about delivery guarantees.

acks=0. The producer sends the message and does not wait for acknowledgment. Fastest but messages can be lost. Use for metrics and logs where occasional loss is acceptable.

acks=1. The producer waits for the partition leader to acknowledge. If the leader crashes before replication, the message is lost. Good balance of speed and reliability.

acks=all. The producer waits for all in-sync replicas to acknowledge. Slowest but no data loss as long as at least one replica survives. Required for financial transactions, audit logs, and anything you cannot afford to lose.

Idempotent producer. Kafka supports idempotent writes. The producer assigns a sequence number to each message. If a network retry causes a duplicate, Kafka deduplicates it. Enable with enable.idempotence=true.

Exactly-once semantics. Kafka’s transactional API lets a producer write to multiple partitions atomically. Combined with idempotent writes, this provides exactly-once within Kafka. Crossing system boundaries still requires application-level idempotency.

The consumer reads from the events topic:

containers:
- name: consumer
image: docker.io/bitnami/kafka:3.6
command:
- /bin/bash
- -c
- |
echo "Waiting for Kafka to be ready..."
sleep 40
echo "Starting consumer, reading from topic 'events'..."
kafka-console-consumer.sh \
--bootstrap-server kafka-0.kafka.kafka-demo.svc.cluster.local:9092 \
--topic events \
--from-beginning

The --from-beginning flag tells the consumer to start at offset 0. Without it, the consumer only reads new messages. The 40-second sleep ensures Kafka and the producer are running before the consumer starts.

At-most-once. Commit the offset before processing. If processing fails, the message is skipped. No duplicates, but messages can be lost.

At-least-once. Process the message, then commit the offset. If the consumer crashes after processing but before committing, it reprocesses the message. No lost messages, but duplicates are possible. This is Kafka’s default behavior.

Exactly-once. Use Kafka’s transactional consumer API. More complex and slower but eliminates both loss and duplication.

Most systems use at-least-once and design idempotent consumers. Processing the same order twice should produce the same result.

Ordering guarantees depend on partitioning. A single partition gives total ordering but limits throughput to one consumer per group. Multiple partitions allow parallel consumption but only guarantee order within each partition.

To maintain order for related events, use a message key. Kafka hashes the key to select a partition. All messages with the same key go to the same partition:

Key: order-123 -> hash -> Partition 1
Key: order-456 -> hash -> Partition 0
Key: order-123 -> hash -> Partition 1 (same partition)

This demo does not use keys. Messages distribute round-robin.

Zookeeper manages Kafka’s cluster metadata:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zookeeper
namespace: kafka-demo
spec:
serviceName: zookeeper
replicas: 1
selector:
matchLabels:
app: zookeeper
template:
metadata:
labels:
app: zookeeper
spec:
containers:
- name: zookeeper
image: docker.io/bitnami/zookeeper:3.9
ports:
- containerPort: 2181
name: client
env:
- name: ALLOW_ANONYMOUS_LOGIN
value: "yes"

Zookeeper tracks which brokers are alive, which broker leads each partition, topic configuration, and access control lists.

Kafka connects to Zookeeper via the KAFKA_CFG_ZOOKEEPER_CONNECT environment variable:

env:
- name: KAFKA_CFG_ZOOKEEPER_CONNECT
value: zookeeper:2181

Starting with Kafka 3.3, Kafka can run in KRaft mode, where brokers manage their own metadata using an internal Raft consensus protocol. KRaft eliminates the operational burden of running Zookeeper separately: fewer moving parts, fewer failure modes, faster metadata operations.

This demo uses Zookeeper because it is simpler for learning. In production with Kafka 3.6+, KRaft is the recommended approach.

Running Kafka on Kubernetes requires careful attention to storage and networking. Both Kafka and Zookeeper need stable identities and persistent storage. That is why they run as StatefulSets.

A Deployment creates pods with random names. If a pod restarts, it gets a new name. Kafka needs stable names because the advertised listener address must be predictable:

- name: KAFKA_CFG_ADVERTISED_LISTENERS
value: PLAINTEXT://kafka-0.kafka.kafka-demo.svc.cluster.local:9092

kafka-0 is the stable name that the StatefulSet guarantees. If the pod restarts, it comes back as kafka-0. Clients reconnect to the same address.

With a Deployment, the pod might restart as kafka-7f8b6c4d5-xq2pn. The advertised listener would be wrong, and clients could not reconnect.

Both StatefulSets use headless Services (clusterIP: None):

apiVersion: v1
kind: Service
metadata:
name: kafka
namespace: kafka-demo
spec:
selector:
app: kafka
ports:
- port: 9092
targetPort: 9092
name: kafka
clusterIP: None

A headless Service does not get a ClusterIP. DNS queries return the IP addresses of individual pods. This lets clients connect directly to specific brokers, which is essential because each broker holds different partitions.

The DNS name kafka-0.kafka.kafka-demo.svc.cluster.local resolves directly to the kafka-0 pod’s IP. With a regular Service, the DNS name would resolve to the Service’s ClusterIP, and kube-proxy would route to a random pod. That breaks Kafka’s client-to-broker protocol.

Both StatefulSets use volumeClaimTemplates for persistent storage:

volumeClaimTemplates:
- metadata:
name: kafka-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi

Each pod gets its own PersistentVolumeClaim. If kafka-0 restarts, it reattaches to the same PVC and recovers its data. In production, size the PVC based on retention and throughput. Kafka defaults to 7-day retention. At 1 MB/sec, that is about 600 GB. Use SSDs for latency-sensitive workloads.

Kafka uses the OS page cache heavily. More memory means more messages served from cache. This demo allocates 512Mi-1Gi. In production, give brokers 6-8 GB. Set the JVM heap to one-third of available memory and leave the rest for the page cache.

Kafka is often compared to RabbitMQ and Redis Streams. They solve related but different problems.

RabbitMQ is a traditional message broker using exchanges and queues. It deletes messages after consumer acknowledgment, pushes messages to consumers (Kafka consumers pull), and does not support replay. Choose RabbitMQ for complex routing patterns, strict process-once semantics, or moderate throughput (thousands of messages per second).

Redis Streams provide a lightweight append-only log in memory. Lower throughput than Kafka (tens of thousands vs millions per second), simpler to operate, and good enough when you already run Redis and your volume is low. Redis Streams lack Kafka’s mature consumer group rebalancing and configurable retention.

Kafka excels at high throughput (100k+ messages per second), durable storage with configurable retention, multi-consumer fan-out, message replay, and ordering guarantees within partitions.

Managing Kafka by hand works for learning. In production, the Strimzi operator automates lifecycle management with CRDs for clusters, topics, users, and connectors:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
replicas: 3
storage:
type: persistent-claim
size: 100Gi
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi

Strimzi handles rolling upgrades, certificate management, and rack-aware replication. It is the recommended approach for Kafka on OpenShift.

Kafka’s durable, ordered log enables two advanced architectural patterns.

Traditional systems store current state. Event sourcing stores every state change as an event: OrderCreated, PaymentReceived, OrderShipped. The current state is derived by replaying all events. Kafka is a natural fit because its log is append-only, ordered, and durable. The events topic in this demo is a simple version of this pattern.

Benefits: complete audit trail, ability to rebuild state by replay, temporal queries, and decoupled consumers reacting independently.

Trade-offs: increased storage, complex schema evolution, and eventual consistency between the event log and derived state.

CQRS (Command Query Responsibility Segregation)

Section titled “CQRS (Command Query Responsibility Segregation)”

CQRS separates the write model from the read model. The write side publishes events to Kafka. The read side consumes those events and builds optimized read models: denormalized tables, search indexes, materialized views.

Example: the order service writes OrderCreated events. A reporting consumer builds a daily summary. A search consumer updates Elasticsearch. A notification consumer sends emails. Each builds its own view.

CQRS works when read and write patterns differ significantly. It adds complexity, so it is not appropriate for simple CRUD applications.

The most important thing this demo illustrates is decoupling. Stop the consumer. The producer keeps sending. Messages accumulate in Kafka. Restart the consumer. It catches up.

Terminal window
kubectl scale deployment consumer --replicas=0 -n kafka-demo
# Wait 30 seconds
kubectl scale deployment consumer --replicas=1 -n kafka-demo
kubectl logs -f deploy/consumer -n kafka-demo

The producer did not fail. It did not slow down. It did not even notice. Kafka absorbed the difference. This is the core value proposition of event-driven architecture.

In a synchronous system, the consumer going offline causes the producer to fail (or at least timeout). In an event-driven system, the broker provides temporal decoupling. Services can operate on different schedules, at different speeds, with independent uptime.

  • Microservices Platform covers the service decomposition patterns that produce the events Kafka carries.
  • API Gateway shows how Kong handles synchronous north-south traffic, the complement to Kafka’s asynchronous east-west messaging.