OpenTelemetry & Distributed Tracing: Deep Dive

Distributed tracing shows you the complete journey of a request across multiple services. This document explains how it works, why it matters, and how OpenTelemetry makes it possible.

The Three Pillars of Observability

Modern systems are observed through three complementary lenses:

Metrics: What is happening right now? CPU at 80%, response time p99 is 500ms, error rate is 2%. Metrics give you aggregated numbers over time. They tell you WHAT is wrong but not WHY. You built this in demo 22 - Prometheus & Grafana.

Logs: Discrete events. “User 1234 logged in”, “Payment failed: insufficient funds”, “Database query took 200ms”. Logs tell you WHERE something happened, but correlating logs across services is manual work. You built this in demo 35 - EFK Logging.

Traces: The full request journey. A single trace shows every service the request touched, every database query, every external API call, with exact timing. Traces tell you WHY something is slow or broken.

Example: metrics show 500ms p99 latency. Logs show payment-service called inventory-service. But a trace shows:

frontend (5ms)
  -> cart-service (10ms)
    -> inventory-service (200ms)
      -> postgres query "SELECT * FROM inventory WHERE..." (195ms)

Now you know the exact query that is slow. Metrics and logs together could not give you this.

What is a Trace?

A trace is a tree of spans. Each span represents one unit of work: an HTTP request, a database query, a cache lookup.

trace_id: 1234567890abcdef
spans:
  - span_id: aaa
    parent_id: null
    name: GET /checkout
    service: frontend
    start: 10:00:00.000
    duration: 215ms

  - span_id: bbb
    parent_id: aaa
    name: POST /cart
    service: cart-service
    start: 10:00:00.005
    duration: 210ms

  - span_id: ccc
    parent_id: bbb
    name: SELECT * FROM inventory
    service: inventory-service
    start: 10:00:00.010
    duration: 200ms

Every span in the tree shares the same trace_id. The root span has parent_id: null. Child spans reference their parent via parent_id. The collector reassembles these into a tree visualization.

Trace Context Propagation

How does the trace context travel across services? Via HTTP headers.

The W3C Trace Context standard defines the traceparent header:

traceparent: 00-1234567890abcdef-aaa000000000001-01
             ^^  ^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^  ^^
             version  trace-id      parent-span-id  flags

When frontend calls cart-service, it sends this header:

GET /cart HTTP/1.1
Host: cart-service
traceparent: 00-1234567890abcdef-aaa000000000001-01

Cart-service extracts the trace-id, creates a new span with a new span-id (bbb), sets parent-id to aaa, and sends the updated header to inventory-service:

POST /inventory HTTP/1.1
Host: inventory-service
traceparent: 00-1234567890abcdef-bbb000000000002-01

This continues across every service boundary. The trace-id stays the same. Each service creates a new span-id and updates the parent-id.

Without automatic propagation, you would need to manually log correlation IDs and grep across service logs. Tracing automates this.

OpenTelemetry: The Standard

Before OpenTelemetry, every tracing vendor had its own SDK. If you used Jaeger, you imported Jaeger libraries. If you switched to Zipkin, you rewrote your instrumentation.

OpenTelemetry (OTEL) is a CNCF standard that separates instrumentation from backends. You instrument your app once with OTEL SDKs. You can send the data to Jaeger, Zipkin, Datadog, Honeycomb, or any OTLP-compatible backend.

OTEL components:

SDKs: Libraries for Go, Python, Java, Node.js, etc. You import the SDK, create spans, and the SDK handles propagation.
Collector: Optional component that receives traces from apps, batches them, filters them, and forwards to backends. Useful for sampling, security, and multi-backend routing.
OTLP (OpenTelemetry Protocol): Wire format for sending traces, metrics, and logs. Supports gRPC (port 4317) and HTTP (port 4318).

In this demo, HotROD uses the OTEL SDK and sends traces directly to Jaeger via OTLP HTTP (port 4318). No collector in the middle because this is a learning demo.

Jaeger Architecture

Jaeger has four components:

Agent: (Deprecated in favor of OTLP) Used to run on every node and forward traces to the collector.

Collector: Receives traces via OTLP, validates them, and writes to storage.

Storage: Elasticsearch, Cassandra, or in-memory. This demo uses in-memory for simplicity. Production uses Elasticsearch or Cassandra for persistence and search.

Query: Provides the web UI and API for searching traces.

The jaegertracing/all-in-one image bundles all four into one container. Production runs them separately for scaling and reliability.

env:
- name: COLLECTOR_OTLP_ENABLED
  value: "true"

This enables the OTLP receiver. Without it, Jaeger only accepts the legacy Jaeger Thrift protocol.

How HotROD Works

HotROD (Rides On Demand) is a demo app that simulates a ride-sharing platform. It has four microservices: frontend, customer, driver, route. All four run in a single binary for simplicity.

Each service is instrumented with the OpenTelemetry Go SDK. When you click a customer button:

Frontend creates a root span: GET /dispatch?customer=rachel
Frontend calls customer service: GET /customer?id=rachel. The OTEL SDK injects the traceparent header automatically.
Customer service extracts the trace context, creates a child span, and calls driver service: GET /drivers?location=...
Driver service creates parallel child spans to query multiple routes.

All spans are sent to Jaeger via OTLP. The SDK batches them for efficiency.

env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
  value: "http://jaeger:4318"

This tells the SDK where to send traces. The SDK handles retries, batching, and error handling.

When to Use Tracing

Tracing is essential for:

Microservices architectures. You cannot debug cross-service latency without traces.
Distributed systems with async messaging (Kafka, RabbitMQ). Traces show the full message flow.
Performance optimization. Traces show which service or query is slow.
Root cause analysis. When something breaks, traces show exactly where.

Tracing is overkill for:

Monolithic apps. Use profiling instead.
Simple request-response APIs with no downstream calls. Metrics and logs are enough.
High-throughput systems where sampling overhead matters. Use tail-based sampling (see below).

Sampling Strategies

Sampling decides which requests to trace. Tracing 100% of traffic is expensive at scale.

Always sample (head-based, 100%): Trace every request. Good for development and debugging. This demo uses this.

Probabilistic (head-based, e.g., 1%): Trace 1% of requests randomly. Decision is made at the root span. Good for production with high traffic.

Tail-based: Decide AFTER the request completes. Trace all errors and slow requests, sample a percentage of normal requests. Requires a collector to buffer and decide. Most powerful but most complex.

Example tail-based rule: trace all requests with errors=true, trace all requests slower than 1s, trace 1% of everything else.

Span Attributes and Events

Spans carry metadata:

Tags (attributes): Key-value pairs. http.method=GET, http.status_code=200, db.system=postgres, error=true. Tags are indexed for search.

Logs (events): Timestamped messages within a span. Example: “cache miss”, “retrying request”, “received response”. Events show what happened during the span.

In the Jaeger UI, click a span to see its tags and events. This is how you debug why a request failed or was slow.

Production Considerations

Security: The all-in-one Jaeger has no authentication. Production deployments enable OIDC or mTLS on the UI and collector.

Storage: In-memory storage is lost on restart. Elasticsearch or Cassandra is required for production.

Collector: Run the OTEL Collector as a separate deployment to handle batching, sampling, and multi-backend routing. Apps send to the collector. The collector sends to Jaeger, Prometheus, S3, etc.

Sampling: Start with 1% probabilistic sampling. Add tail-based sampling for errors and slow requests.

Cardinality: Avoid high-cardinality tags like user IDs or request IDs. They explode the index size. Use them as logs, not tags.

Retention: Traces are large. 7-day retention is common. Older traces are archived or deleted.

Integrating Traces with Metrics and Logs

Traces work best when linked to metrics and logs.

Exemplars: Prometheus can attach trace IDs to metric samples. When you see a spike in latency, click the exemplar to jump directly to the trace.

Logs with trace IDs: Log the trace ID in every log line. Example: [trace_id=1234567890abcdef] Payment failed. Now you can search logs by trace ID or search traces by log content.

Dashboards: Grafana can show Jaeger traces inline with Prometheus metrics. You see the latency graph and click a spike to see example traces.

This is “observability”: metrics, logs, and traces all linked together.

Real-World Value

Imagine a payment API that is slow. Metrics show p99 latency is 2 seconds. Logs show “payment-service took 1.8s”. But why?

A trace shows:

payment-service (1.8s)
  -> fraud-check (1.5s)
    -> external-api call to fraud.example.com (1.4s)

Now you know the external fraud API is the bottleneck. You can cache fraud responses, increase timeout, or switch providers. Metrics and logs alone would take hours of manual correlation to figure this out.

Another example: a canary deployment is failing. Traces show the canary version calls a new database index that does not exist yet. The old version calls a different query. Without traces, you would need to read code and correlate logs manually.

OpenTelemetry Collector Deep Dive

The collector is optional but powerful. It sits between apps and backends:

Apps  -->  OTLP  -->  Collector  -->  Jaeger, Prometheus, S3, etc.

Why use a collector?

Batching: Apps send individual spans. The collector batches them for efficiency.
Sampling: Apps send 100%. The collector samples to 1% before forwarding.
Multi-backend: Send traces to Jaeger AND Datadog without changing app config.
Security: Apps send to the collector on localhost. The collector forwards with authentication.
Filtering: Drop traces from health checks or other noise.

Example collector config:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  probabilistic_sampler:
    sampling_percentage: 1.0

exporters:
  otlp:
    endpoint: jaeger:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, probabilistic_sampler]
      exporters: [otlp]

This receives OTLP on port 4318, batches spans, samples 1%, and forwards to Jaeger on port 4317.

Semantic Conventions

OpenTelemetry defines semantic conventions for span names and attributes. This ensures consistency across languages and libraries.

Examples:

HTTP spans: http.method, http.status_code, http.url
Database spans: db.system=postgresql, db.statement=SELECT * FROM users
Messaging spans: messaging.system=kafka, messaging.destination=orders-topic

Follow the conventions. They enable cross-service correlation and vendor portability.

Key Takeaways

Distributed tracing is the third pillar of observability, alongside metrics and logs. It shows the full request journey across services, making it essential for debugging microservices.

OpenTelemetry standardizes instrumentation. You instrument once with OTEL SDKs and can switch backends without code changes.

Jaeger provides storage, search, and visualization for traces. The all-in-one image is great for learning. Production uses separate collector, storage, and query components.

Traces answer questions that metrics and logs cannot: “Which service is slow? Which database query? Which external API?” This makes tracing invaluable for performance optimization and root cause analysis.

Combine traces with metrics (exemplars) and logs (trace IDs in log lines) for full observability. The three pillars together give you complete visibility into your system.