Event-Driven Architecture (Kafka): Deep Dive

This demo runs Apache Kafka and Zookeeper as StatefulSets in Kubernetes. A producer sends timestamped messages to a Kafka topic every 5 seconds. A consumer reads those messages in real time. The producer and consumer do not know about each other. Kafka sits between them, handling storage, ordering, and delivery.

This document explains why event-driven architecture exists, how Kafka works internally, how it differs from traditional message queues, and what it takes to run Kafka reliably on Kubernetes.

Why Event-Driven Architecture

In the previous demo, the backend serves API requests synchronously. A client sends a request. The backend responds. Done. This works until it does not.

Synchronous communication creates temporal coupling. Both parties must be online at the same time. If the backend is down, the request fails. If the backend is slow, the client waits.

Event-driven architecture removes that coupling. The producer publishes an event. The broker stores it. The consumer reads it when ready. If the consumer is offline, events accumulate in the broker. When the consumer comes back, it catches up.

This pattern is essential for:

Order processing pipelines where multiple systems react to a single event
Audit logging where events must be durable and ordered
Real-time analytics where high-volume data streams from many sources
Integration between systems owned by different teams

Kafka Architecture

Kafka is a distributed commit log. It is not a traditional message queue. Understanding that distinction is key.

Brokers

A Kafka cluster consists of one or more brokers. Each broker is a server that stores data and serves clients. This demo runs a single broker:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: kafka-demo
spec:
  serviceName: kafka
  replicas: 1
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
      - name: kafka
        image: docker.io/bitnami/kafka:3.6
        ports:
        - containerPort: 9092
          name: kafka
        env:
        - name: KAFKA_CFG_ZOOKEEPER_CONNECT
          value: zookeeper:2181
        - name: KAFKA_CFG_LISTENERS
          value: PLAINTEXT://:9092
        - name: KAFKA_CFG_ADVERTISED_LISTENERS
          value: PLAINTEXT://kafka-0.kafka.kafka-demo.svc.cluster.local:9092
        - name: KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
          value: "true"
        - name: ALLOW_PLAINTEXT_LISTENER
          value: "yes"

In production, you run 3 or more brokers for fault tolerance. Data is replicated across brokers so that no single failure loses messages.

Topics

A topic is a named stream of events. This demo uses a topic called events. The producer writes to it. The consumer reads from it.

Topics are created automatically because KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE is set to true. In production, you usually create topics explicitly with a defined number of partitions and a replication factor:

kafka-topics.sh --create --topic orders \
  --partitions 3 --replication-factor 1 \
  --bootstrap-server localhost:9092

Partitions

A topic is divided into partitions. Each partition is an ordered, immutable sequence of messages. Partitions are the unit of parallelism in Kafka.

With one partition, messages are strictly ordered. With three partitions, Kafka distributes messages across them. Ordering is guaranteed within a partition but not across partitions.

Topic: events
  Partition 0: [msg1, msg4, msg7, ...]
  Partition 1: [msg2, msg5, msg8, ...]
  Partition 2: [msg3, msg6, msg9, ...]

Each partition lives on one broker (with replicas on other brokers). Adding partitions allows more consumers to read in parallel, but you cannot reduce the number of partitions after creation.

Consumer Groups

Consumers belong to a group. Kafka assigns each partition to exactly one consumer in the group. If you have 3 partitions and 3 consumers in the same group, each consumer reads from one partition.

If a consumer crashes, Kafka reassigns its partition to another consumer in the group. This is called rebalancing.

The consumer in this demo uses kafka-console-consumer.sh with --from-beginning, which reads all messages from the start of the topic. In production, consumers track their position using offsets and resume from where they left off.

Offsets

Each message in a partition has a numeric offset. Offset 0 is the first message. Offset 1 is the second. Offsets are never reused.

Consumers commit their current offset to Kafka. If a consumer restarts, it reads from the last committed offset. This is how Kafka provides at-least-once delivery: the consumer processes a message, then commits its offset. If it crashes before committing, it reprocesses the message after restart.

The Producer

The producer sends a message every 5 seconds:

containers:
- name: producer
  image: docker.io/bitnami/kafka:3.6
  command:
  - /bin/bash
  - -c
  - |
    echo "Waiting for Kafka to be ready..."
    sleep 30
    echo "Starting producer, sending to topic 'events'..."
    count=0
    while true; do
      count=$((count + 1))
      message="event-${count}: order-placed at $(date '+%Y-%m-%d %H:%M:%S')"
      echo "$message" | kafka-console-producer.sh \
        --broker-list kafka-0.kafka.kafka-demo.svc.cluster.local:9092 \
        --topic events
      echo "[producer] Sent: $message"
      sleep 5
    done

Two things to notice. First, the producer uses the StatefulSet DNS name kafka-0.kafka.kafka-demo.svc.cluster.local to reach the broker. This is stable across pod restarts. Second, the producer sleeps 30 seconds before starting to give Kafka time to initialize.

Producer Semantics

In production, producers care deeply about delivery guarantees.

acks=0. The producer sends the message and does not wait for acknowledgment. Fastest but messages can be lost. Use for metrics and logs where occasional loss is acceptable.

acks=1. The producer waits for the partition leader to acknowledge. If the leader crashes before replication, the message is lost. Good balance of speed and reliability.

acks=all. The producer waits for all in-sync replicas to acknowledge. Slowest but no data loss as long as at least one replica survives. Required for financial transactions, audit logs, and anything you cannot afford to lose.

Idempotent producer. Kafka supports idempotent writes. The producer assigns a sequence number to each message. If a network retry causes a duplicate, Kafka deduplicates it. Enable with enable.idempotence=true.

Exactly-once semantics. Kafka’s transactional API lets a producer write to multiple partitions atomically. Combined with idempotent writes, this provides exactly-once within Kafka. Crossing system boundaries still requires application-level idempotency.

The Consumer

The consumer reads from the events topic:

containers:
- name: consumer
  image: docker.io/bitnami/kafka:3.6
  command:
  - /bin/bash
  - -c
  - |
    echo "Waiting for Kafka to be ready..."
    sleep 40
    echo "Starting consumer, reading from topic 'events'..."
    kafka-console-consumer.sh \
      --bootstrap-server kafka-0.kafka.kafka-demo.svc.cluster.local:9092 \
      --topic events \
      --from-beginning

The --from-beginning flag tells the consumer to start at offset 0. Without it, the consumer only reads new messages. The 40-second sleep ensures Kafka and the producer are running before the consumer starts.

Consumer Semantics

At-most-once. Commit the offset before processing. If processing fails, the message is skipped. No duplicates, but messages can be lost.

At-least-once. Process the message, then commit the offset. If the consumer crashes after processing but before committing, it reprocesses the message. No lost messages, but duplicates are possible. This is Kafka’s default behavior.

Exactly-once. Use Kafka’s transactional consumer API. More complex and slower but eliminates both loss and duplication.

Most systems use at-least-once and design idempotent consumers. Processing the same order twice should produce the same result.

Topic Partitioning and Ordering

Ordering guarantees depend on partitioning. A single partition gives total ordering but limits throughput to one consumer per group. Multiple partitions allow parallel consumption but only guarantee order within each partition.

To maintain order for related events, use a message key. Kafka hashes the key to select a partition. All messages with the same key go to the same partition:

Key: order-123 -> hash -> Partition 1
Key: order-456 -> hash -> Partition 0
Key: order-123 -> hash -> Partition 1  (same partition)

This demo does not use keys. Messages distribute round-robin.

Zookeeper’s Role

Zookeeper manages Kafka’s cluster metadata:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zookeeper
  namespace: kafka-demo
spec:
  serviceName: zookeeper
  replicas: 1
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
      - name: zookeeper
        image: docker.io/bitnami/zookeeper:3.9
        ports:
        - containerPort: 2181
          name: client
        env:
        - name: ALLOW_ANONYMOUS_LOGIN
          value: "yes"

Zookeeper tracks which brokers are alive, which broker leads each partition, topic configuration, and access control lists.

Kafka connects to Zookeeper via the KAFKA_CFG_ZOOKEEPER_CONNECT environment variable:

env:
- name: KAFKA_CFG_ZOOKEEPER_CONNECT
  value: zookeeper:2181

KRaft: Removing Zookeeper

Starting with Kafka 3.3, Kafka can run in KRaft mode, where brokers manage their own metadata using an internal Raft consensus protocol. KRaft eliminates the operational burden of running Zookeeper separately: fewer moving parts, fewer failure modes, faster metadata operations.

This demo uses Zookeeper because it is simpler for learning. In production with Kafka 3.6+, KRaft is the recommended approach.

Kafka on Kubernetes

Running Kafka on Kubernetes requires careful attention to storage and networking. Both Kafka and Zookeeper need stable identities and persistent storage. That is why they run as StatefulSets.

Why StatefulSets

A Deployment creates pods with random names. If a pod restarts, it gets a new name. Kafka needs stable names because the advertised listener address must be predictable:

- name: KAFKA_CFG_ADVERTISED_LISTENERS
  value: PLAINTEXT://kafka-0.kafka.kafka-demo.svc.cluster.local:9092

kafka-0 is the stable name that the StatefulSet guarantees. If the pod restarts, it comes back as kafka-0. Clients reconnect to the same address.

With a Deployment, the pod might restart as kafka-7f8b6c4d5-xq2pn. The advertised listener would be wrong, and clients could not reconnect.

Headless Services

Both StatefulSets use headless Services (clusterIP: None):

apiVersion: v1
kind: Service
metadata:
  name: kafka
  namespace: kafka-demo
spec:
  selector:
    app: kafka
  ports:
  - port: 9092
    targetPort: 9092
    name: kafka
  clusterIP: None

A headless Service does not get a ClusterIP. DNS queries return the IP addresses of individual pods. This lets clients connect directly to specific brokers, which is essential because each broker holds different partitions.

The DNS name kafka-0.kafka.kafka-demo.svc.cluster.local resolves directly to the kafka-0 pod’s IP. With a regular Service, the DNS name would resolve to the Service’s ClusterIP, and kube-proxy would route to a random pod. That breaks Kafka’s client-to-broker protocol.

Storage Considerations

Both StatefulSets use volumeClaimTemplates for persistent storage:

volumeClaimTemplates:
- metadata:
    name: kafka-data
  spec:
    accessModes: ["ReadWriteOnce"]
    resources:
      requests:
        storage: 2Gi

Each pod gets its own PersistentVolumeClaim. If kafka-0 restarts, it reattaches to the same PVC and recovers its data. In production, size the PVC based on retention and throughput. Kafka defaults to 7-day retention. At 1 MB/sec, that is about 600 GB. Use SSDs for latency-sensitive workloads.

Resource Requirements

Kafka uses the OS page cache heavily. More memory means more messages served from cache. This demo allocates 512Mi-1Gi. In production, give brokers 6-8 GB. Set the JVM heap to one-third of available memory and leave the rest for the page cache.

Kafka vs Message Queues

Kafka is often compared to RabbitMQ and Redis Streams. They solve related but different problems.

RabbitMQ

RabbitMQ is a traditional message broker using exchanges and queues. It deletes messages after consumer acknowledgment, pushes messages to consumers (Kafka consumers pull), and does not support replay. Choose RabbitMQ for complex routing patterns, strict process-once semantics, or moderate throughput (thousands of messages per second).

Redis Streams

Redis Streams provide a lightweight append-only log in memory. Lower throughput than Kafka (tens of thousands vs millions per second), simpler to operate, and good enough when you already run Redis and your volume is low. Redis Streams lack Kafka’s mature consumer group rebalancing and configurable retention.

When to Choose Kafka

Kafka excels at high throughput (100k+ messages per second), durable storage with configurable retention, multi-consumer fan-out, message replay, and ordering guarantees within partitions.

Strimzi Operator

Managing Kafka by hand works for learning. In production, the Strimzi operator automates lifecycle management with CRDs for clusters, topics, users, and connectors:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    replicas: 3
    storage:
      type: persistent-claim
      size: 100Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi

Strimzi handles rolling upgrades, certificate management, and rack-aware replication. It is the recommended approach for Kafka on OpenShift.

Event Sourcing and CQRS

Kafka’s durable, ordered log enables two advanced architectural patterns.

Event Sourcing

Traditional systems store current state. Event sourcing stores every state change as an event: OrderCreated, PaymentReceived, OrderShipped. The current state is derived by replaying all events. Kafka is a natural fit because its log is append-only, ordered, and durable. The events topic in this demo is a simple version of this pattern.

Benefits: complete audit trail, ability to rebuild state by replay, temporal queries, and decoupled consumers reacting independently.

Trade-offs: increased storage, complex schema evolution, and eventual consistency between the event log and derived state.

CQRS (Command Query Responsibility Segregation)

CQRS separates the write model from the read model. The write side publishes events to Kafka. The read side consumes those events and builds optimized read models: denormalized tables, search indexes, materialized views.

Example: the order service writes OrderCreated events. A reporting consumer builds a daily summary. A search consumer updates Elasticsearch. A notification consumer sends emails. Each builds its own view.

CQRS works when read and write patterns differ significantly. It adds complexity, so it is not appropriate for simple CRUD applications.

Decoupling in Practice

The most important thing this demo illustrates is decoupling. Stop the consumer. The producer keeps sending. Messages accumulate in Kafka. Restart the consumer. It catches up.

kubectl scale deployment consumer --replicas=0 -n kafka-demo
# Wait 30 seconds
kubectl scale deployment consumer --replicas=1 -n kafka-demo
kubectl logs -f deploy/consumer -n kafka-demo

The producer did not fail. It did not slow down. It did not even notice. Kafka absorbed the difference. This is the core value proposition of event-driven architecture.

In a synchronous system, the consumer going offline causes the producer to fail (or at least timeout). In an event-driven system, the broker provides temporal decoupling. Services can operate on different schedules, at different speeds, with independent uptime.

Where to Go Next

Microservices Platform covers the service decomposition patterns that produce the events Kafka carries.
API Gateway shows how Kong handles synchronous north-south traffic, the complement to Kafka’s asynchronous east-west messaging.