Blog 17 — Uber: Distributed Tracing with Jaeger

C

Qubits of DPK

March 21, 2026

Core Case Studies
Core Concept: Distributed tracing, Correlation IDs, Jaeger/OpenTelemetry, Debugging failures across microservices
Why SDE-2 Critical: SDE-2 owns production — this topic separates seniors from juniors in interviews
Status: Draft notes ready

Quick Revision

  • Problem: One failing request touches many services and hides the root cause.
  • Core pattern: Trace IDs, spans, sampling, and a tracing backend like Jaeger.
  • Interview one-liner: Metrics tell you something is wrong; traces tell you where the request actually broke.

️ The Core Problem

javascript
QUBITS OF DPK
1User reports: "My Uber ride booking failed"
2
3Microservices involved in one ride booking:
4  API GatewayAuth ServiceTrip Service5  Driver ServicePricing ServiceMaps Service6  Payment ServiceNotification Service
7
8Which service failed? How long did each take?
9Without tracing: check logs on 8 different servers — nightmare
10With tracing: one dashboard shows entire request flow → 5 seconds to find bug

Core Concepts

Trace, Span, Correlation ID

javascript
QUBITS OF DPK
1Trace = entire journey of one request across all services
2  Trace ID: "abc-123" (same for entire request)
3
4Span = one unit of work within the trace
5  Span 1: API Gateway (2ms)
6  Span 2: Auth Service (5ms) [child of Span 1]
7  Span 3: Trip Service (50ms) [child of Span 1]
8  Span 4: Maps Service (45ms) [child of Span 3]
9  Span 5: Payment Service (200ms) [child of Span 1]SLOW!
10
11Correlation ID = Trace ID passed in every HTTP header:
12  X-Trace-ID: abc-123
13  Every service logs this ID → all logs linkable

How Jaeger Works

javascript
QUBITS OF DPK
1Service A handles request:
2  1. Generate Trace ID (if first service)
3     OR extract from incoming header (if downstream)
4  2. Create Span with start timestamp
5  3. Call Service B, pass Trace ID in header
6  4. Service B creates child Span
7  5. On completion: report Span to Jaeger Agent (local sidecar)
8
9Jaeger Agent → batches spans → Jaeger CollectorStorage (Cassandra/ES)
10
11Jaeger UI:
12  Search by Trace ID → see waterfall diagram of all spans
13  Immediately see: which service is slow, which failed

Sampling Strategy

javascript
QUBITS OF DPK
1Problem: 1M requests/second, trace ALL of them?
2Massive storage + overhead
3
4Solution: Sampling
5  Head-based sampling: sample 1% of requests randomly
6    Pro: Low overhead
7    Con: Miss rare bugs that happen in 0.1% of requests
8
9  Tail-based sampling (Uber's approach):
10    Collect ALL spans in buffer
11    If request completes normally → discard
12    If request is slow or errors → KEEP the trace
13    Pro: Always capture interesting traces
14    Con: Higher buffer memory

OpenTelemetry — The Standard

javascript
QUBITS OF DPK
1Problem: Every APM tool (Jaeger, Zipkin, Datadog, New Relic)
2  has different SDK → vendor lock-in
3
4OpenTelemetry: Vendor-neutral standard for:
5  ├── Traces (distributed tracing)
6  ├── Metrics (counters, gauges, histograms)
7  └── Logs (structured logging)
8
9Instrument once with OTel SDKexport to any backend
10Switch from Jaeger to Datadog without code changes

The Three Pillars of Observability

javascript
QUBITS OF DPK
11. Metrics: What is happening? (aggregate)
2   "Payment service has 5% error rate"
3   Tools: Prometheus, Grafana
4
52. Logs: What happened in detail? (events)
6   "payment_id=xyz failed with NullPointerException at line 234"
7   Tools: ELK Stack (Elasticsearch + Logstash + Kibana)
8
93. Traces: Where did it happen? (causality)
10   "Request abc-123 failed in Payment Service, called from Trip Service"
11   Tools: Jaeger, Zipkin, Datadog APM
12
13All three together = full observability

5 Interview Questions This Blog Unlocks

Q1. A request is failing in your microservices system — how do you debug it?

Answer: Use distributed tracing (Jaeger/Zipkin). Find the trace ID from logs or error report. Pull up trace in Jaeger UI — waterfall view shows all service calls, durations, errors. Identify the failing span (red in UI). Drill into that service's logs using trace ID as correlation ID. Fix root cause.

Q2. What is a correlation ID and why is every SDE-2 expected to know this?

Answer: Unique ID generated at the entry point of a request, passed as a header to every downstream service. Every service logs this ID. Enables searching all service logs for one request using a single ID. Without it, debugging cross-service failures requires manual correlation of timestamps — nearly impossible at scale.

Q3. What is the difference between metrics, logs, and traces?

Answer: Metrics = aggregated numbers (error rate, latency p99, QPS). Good for alerting and dashboards. Logs = individual events with detail (stack traces, request params). Good for debugging specific failures. Traces = causality chain of one request across services. Good for latency analysis and finding which service failed.

Q4. What is tail-based sampling and why is it better than head-based for Uber?

Answer: Head-based: decide to sample at start of request (e.g., 1% random). Risk: miss the rare slow/failing requests that are most important. Tail-based: buffer all spans, decide to keep after request completes. Keep all errors and slow requests, discard normal ones. At Uber's scale, tail-based ensures every bug gets captured.

Q5. What is OpenTelemetry and why does it matter?

Answer: Vendor-neutral observability standard. Instrument once with OTel SDK, export to any backend (Jaeger, Datadog, New Relic, Honeycomb). Prevents vendor lock-in. Standardizes trace, metric, and log formats. Adopted by all major cloud providers. For new systems today, always start with OTel instrumentation.

Key Engineering Lessons