OpenTelemetry & Distributed Tracing: Deep Dive
Distributed tracing shows you the complete journey of a request across multiple services. This document explains how it works, why it matters, and how OpenTelemetry makes it possible.
The Three Pillars of Observability
Section titled “The Three Pillars of Observability”Modern systems are observed through three complementary lenses:
Metrics: What is happening right now? CPU at 80%, response time p99 is 500ms, error rate is 2%. Metrics give you aggregated numbers over time. They tell you WHAT is wrong but not WHY. You built this in demo 22 - Prometheus & Grafana.
Logs: Discrete events. “User 1234 logged in”, “Payment failed: insufficient funds”, “Database query took 200ms”. Logs tell you WHERE something happened, but correlating logs across services is manual work. You built this in demo 35 - EFK Logging.
Traces: The full request journey. A single trace shows every service the request touched, every database query, every external API call, with exact timing. Traces tell you WHY something is slow or broken.
Example: metrics show 500ms p99 latency. Logs show payment-service called inventory-service. But a trace shows:
frontend (5ms) -> cart-service (10ms) -> inventory-service (200ms) -> postgres query "SELECT * FROM inventory WHERE..." (195ms)Now you know the exact query that is slow. Metrics and logs together could not give you this.
What is a Trace?
Section titled “What is a Trace?”A trace is a tree of spans. Each span represents one unit of work: an HTTP request, a database query, a cache lookup.
trace_id: 1234567890abcdefspans: - span_id: aaa parent_id: null name: GET /checkout service: frontend start: 10:00:00.000 duration: 215ms
- span_id: bbb parent_id: aaa name: POST /cart service: cart-service start: 10:00:00.005 duration: 210ms
- span_id: ccc parent_id: bbb name: SELECT * FROM inventory service: inventory-service start: 10:00:00.010 duration: 200msEvery span in the tree shares the same trace_id. The root span has parent_id: null. Child spans reference their parent via parent_id. The collector reassembles these into a tree visualization.
Trace Context Propagation
Section titled “Trace Context Propagation”How does the trace context travel across services? Via HTTP headers.
The W3C Trace Context standard defines the traceparent header:
traceparent: 00-1234567890abcdef-aaa000000000001-01 ^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^ version trace-id parent-span-id flagsWhen frontend calls cart-service, it sends this header:
GET /cart HTTP/1.1Host: cart-servicetraceparent: 00-1234567890abcdef-aaa000000000001-01Cart-service extracts the trace-id, creates a new span with a new span-id (bbb), sets parent-id to aaa, and sends the updated header to inventory-service:
POST /inventory HTTP/1.1Host: inventory-servicetraceparent: 00-1234567890abcdef-bbb000000000002-01This continues across every service boundary. The trace-id stays the same. Each service creates a new span-id and updates the parent-id.
Without automatic propagation, you would need to manually log correlation IDs and grep across service logs. Tracing automates this.
OpenTelemetry: The Standard
Section titled “OpenTelemetry: The Standard”Before OpenTelemetry, every tracing vendor had its own SDK. If you used Jaeger, you imported Jaeger libraries. If you switched to Zipkin, you rewrote your instrumentation.
OpenTelemetry (OTEL) is a CNCF standard that separates instrumentation from backends. You instrument your app once with OTEL SDKs. You can send the data to Jaeger, Zipkin, Datadog, Honeycomb, or any OTLP-compatible backend.
OTEL components:
- SDKs: Libraries for Go, Python, Java, Node.js, etc. You import the SDK, create spans, and the SDK handles propagation.
- Collector: Optional component that receives traces from apps, batches them, filters them, and forwards to backends. Useful for sampling, security, and multi-backend routing.
- OTLP (OpenTelemetry Protocol): Wire format for sending traces, metrics, and logs. Supports gRPC (port 4317) and HTTP (port 4318).
In this demo, HotROD uses the OTEL SDK and sends traces directly to Jaeger via OTLP HTTP (port 4318). No collector in the middle because this is a learning demo.
Jaeger Architecture
Section titled “Jaeger Architecture”Jaeger has four components:
Agent: (Deprecated in favor of OTLP) Used to run on every node and forward traces to the collector.
Collector: Receives traces via OTLP, validates them, and writes to storage.
Storage: Elasticsearch, Cassandra, or in-memory. This demo uses in-memory for simplicity. Production uses Elasticsearch or Cassandra for persistence and search.
Query: Provides the web UI and API for searching traces.
The jaegertracing/all-in-one image bundles all four into one container. Production runs them separately for scaling and reliability.
env:- name: COLLECTOR_OTLP_ENABLED value: "true"This enables the OTLP receiver. Without it, Jaeger only accepts the legacy Jaeger Thrift protocol.
How HotROD Works
Section titled “How HotROD Works”HotROD (Rides On Demand) is a demo app that simulates a ride-sharing platform. It has four microservices: frontend, customer, driver, route. All four run in a single binary for simplicity.
Each service is instrumented with the OpenTelemetry Go SDK. When you click a customer button:
- Frontend creates a root span:
GET /dispatch?customer=rachel - Frontend calls customer service:
GET /customer?id=rachel. The OTEL SDK injects thetraceparentheader automatically. - Customer service extracts the trace context, creates a child span, and calls driver service:
GET /drivers?location=... - Driver service creates parallel child spans to query multiple routes.
All spans are sent to Jaeger via OTLP. The SDK batches them for efficiency.
env:- name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://jaeger:4318"This tells the SDK where to send traces. The SDK handles retries, batching, and error handling.
When to Use Tracing
Section titled “When to Use Tracing”Tracing is essential for:
- Microservices architectures. You cannot debug cross-service latency without traces.
- Distributed systems with async messaging (Kafka, RabbitMQ). Traces show the full message flow.
- Performance optimization. Traces show which service or query is slow.
- Root cause analysis. When something breaks, traces show exactly where.
Tracing is overkill for:
- Monolithic apps. Use profiling instead.
- Simple request-response APIs with no downstream calls. Metrics and logs are enough.
- High-throughput systems where sampling overhead matters. Use tail-based sampling (see below).
Sampling Strategies
Section titled “Sampling Strategies”Sampling decides which requests to trace. Tracing 100% of traffic is expensive at scale.
Always sample (head-based, 100%): Trace every request. Good for development and debugging. This demo uses this.
Probabilistic (head-based, e.g., 1%): Trace 1% of requests randomly. Decision is made at the root span. Good for production with high traffic.
Tail-based: Decide AFTER the request completes. Trace all errors and slow requests, sample a percentage of normal requests. Requires a collector to buffer and decide. Most powerful but most complex.
Example tail-based rule: trace all requests with errors=true, trace all requests slower than 1s, trace 1% of everything else.
Span Attributes and Events
Section titled “Span Attributes and Events”Spans carry metadata:
Tags (attributes): Key-value pairs. http.method=GET, http.status_code=200, db.system=postgres, error=true. Tags are indexed for search.
Logs (events): Timestamped messages within a span. Example: “cache miss”, “retrying request”, “received response”. Events show what happened during the span.
In the Jaeger UI, click a span to see its tags and events. This is how you debug why a request failed or was slow.
Production Considerations
Section titled “Production Considerations”Security: The all-in-one Jaeger has no authentication. Production deployments enable OIDC or mTLS on the UI and collector.
Storage: In-memory storage is lost on restart. Elasticsearch or Cassandra is required for production.
Collector: Run the OTEL Collector as a separate deployment to handle batching, sampling, and multi-backend routing. Apps send to the collector. The collector sends to Jaeger, Prometheus, S3, etc.
Sampling: Start with 1% probabilistic sampling. Add tail-based sampling for errors and slow requests.
Cardinality: Avoid high-cardinality tags like user IDs or request IDs. They explode the index size. Use them as logs, not tags.
Retention: Traces are large. 7-day retention is common. Older traces are archived or deleted.
Integrating Traces with Metrics and Logs
Section titled “Integrating Traces with Metrics and Logs”Traces work best when linked to metrics and logs.
Exemplars: Prometheus can attach trace IDs to metric samples. When you see a spike in latency, click the exemplar to jump directly to the trace.
Logs with trace IDs: Log the trace ID in every log line. Example: [trace_id=1234567890abcdef] Payment failed. Now you can search logs by trace ID or search traces by log content.
Dashboards: Grafana can show Jaeger traces inline with Prometheus metrics. You see the latency graph and click a spike to see example traces.
This is “observability”: metrics, logs, and traces all linked together.
Real-World Value
Section titled “Real-World Value”Imagine a payment API that is slow. Metrics show p99 latency is 2 seconds. Logs show “payment-service took 1.8s”. But why?
A trace shows:
payment-service (1.8s) -> fraud-check (1.5s) -> external-api call to fraud.example.com (1.4s)Now you know the external fraud API is the bottleneck. You can cache fraud responses, increase timeout, or switch providers. Metrics and logs alone would take hours of manual correlation to figure this out.
Another example: a canary deployment is failing. Traces show the canary version calls a new database index that does not exist yet. The old version calls a different query. Without traces, you would need to read code and correlate logs manually.
OpenTelemetry Collector Deep Dive
Section titled “OpenTelemetry Collector Deep Dive”The collector is optional but powerful. It sits between apps and backends:
Apps --> OTLP --> Collector --> Jaeger, Prometheus, S3, etc.Why use a collector?
- Batching: Apps send individual spans. The collector batches them for efficiency.
- Sampling: Apps send 100%. The collector samples to 1% before forwarding.
- Multi-backend: Send traces to Jaeger AND Datadog without changing app config.
- Security: Apps send to the collector on localhost. The collector forwards with authentication.
- Filtering: Drop traces from health checks or other noise.
Example collector config:
receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318
processors: batch: timeout: 1s send_batch_size: 1024 probabilistic_sampler: sampling_percentage: 1.0
exporters: otlp: endpoint: jaeger:4317
service: pipelines: traces: receivers: [otlp] processors: [batch, probabilistic_sampler] exporters: [otlp]This receives OTLP on port 4318, batches spans, samples 1%, and forwards to Jaeger on port 4317.
Semantic Conventions
Section titled “Semantic Conventions”OpenTelemetry defines semantic conventions for span names and attributes. This ensures consistency across languages and libraries.
Examples:
- HTTP spans:
http.method,http.status_code,http.url - Database spans:
db.system=postgresql,db.statement=SELECT * FROM users - Messaging spans:
messaging.system=kafka,messaging.destination=orders-topic
Follow the conventions. They enable cross-service correlation and vendor portability.
Key Takeaways
Section titled “Key Takeaways”Distributed tracing is the third pillar of observability, alongside metrics and logs. It shows the full request journey across services, making it essential for debugging microservices.
OpenTelemetry standardizes instrumentation. You instrument once with OTEL SDKs and can switch backends without code changes.
Jaeger provides storage, search, and visualization for traces. The all-in-one image is great for learning. Production uses separate collector, storage, and query components.
Traces answer questions that metrics and logs cannot: “Which service is slow? Which database query? Which external API?” This makes tracing invaluable for performance optimization and root cause analysis.
Combine traces with metrics (exemplars) and logs (trace IDs in log lines) for full observability. The three pillars together give you complete visibility into your system.