Blog 17 — Uber: Distributed Tracing with Jaeger

Qubits of DPK

March 21, 2026

Core Case Studies

Source: https://www.uber.com/en-IN/blog/engineering/

Core Concept: Distributed tracing, Correlation IDs, Jaeger/OpenTelemetry, Debugging failures across microservices

Why SDE-2 Critical: SDE-2 owns production — this topic separates seniors from juniors in interviews

Status: Draft notes ready

Quick Revision

Problem: One failing request touches many services and hides the root cause.
Core pattern: Trace IDs, spans, sampling, and a tracing backend like Jaeger.
Interview one-liner: Metrics tell you something is wrong; traces tell you where the request actually broke.

️ The Core Problem

javascript

QUBITS OF DPK

1User reports: "My Uber ride booking failed"
2
3Microservices involved in one ride booking:
4  API Gateway → Auth Service → Trip Service →
5  Driver Service → Pricing Service → Maps Service →
6  Payment Service → Notification Service
7
8Which service failed? How long did each take?
9Without tracing: check logs on 8 different servers — nightmare
10With tracing: one dashboard shows entire request flow → 5 seconds to find bug

Core Concepts

Trace, Span, Correlation ID

javascript

QUBITS OF DPK

1Trace = entire journey of one request across all services
2  Trace ID: "abc-123" (same for entire request)
3
4Span = one unit of work within the trace
5  Span 1: API Gateway (2ms)
6  Span 2: Auth Service (5ms) [child of Span 1]
7  Span 3: Trip Service (50ms) [child of Span 1]
8  Span 4: Maps Service (45ms) [child of Span 3]
9  Span 5: Payment Service (200ms) [child of Span 1] ← SLOW!
10
11Correlation ID = Trace ID passed in every HTTP header:
12  X-Trace-ID: abc-123
13  Every service logs this ID → all logs linkable

How Jaeger Works

javascript

QUBITS OF DPK

1Service A handles request:
2  1. Generate Trace ID (if first service)
3     OR extract from incoming header (if downstream)
4  2. Create Span with start timestamp
5  3. Call Service B, pass Trace ID in header
6  4. Service B creates child Span
7  5. On completion: report Span to Jaeger Agent (local sidecar)
8
9Jaeger Agent → batches spans → Jaeger Collector → Storage (Cassandra/ES)
10
11Jaeger UI:
12  Search by Trace ID → see waterfall diagram of all spans
13  Immediately see: which service is slow, which failed

Sampling Strategy

javascript

QUBITS OF DPK

1Problem: 1M requests/second, trace ALL of them?
2  → Massive storage + overhead
3
4Solution: Sampling
5  Head-based sampling: sample 1% of requests randomly
6    Pro: Low overhead
7    Con: Miss rare bugs that happen in 0.1% of requests
8
9  Tail-based sampling (Uber's approach):
10    Collect ALL spans in buffer
11    If request completes normally → discard
12    If request is slow or errors → KEEP the trace
13    Pro: Always capture interesting traces
14    Con: Higher buffer memory

OpenTelemetry — The Standard

javascript

QUBITS OF DPK

1Problem: Every APM tool (Jaeger, Zipkin, Datadog, New Relic)
2  has different SDK → vendor lock-in
3
4OpenTelemetry: Vendor-neutral standard for:
5  ├── Traces (distributed tracing)
6  ├── Metrics (counters, gauges, histograms)
7  └── Logs (structured logging)
8
9Instrument once with OTel SDK → export to any backend
10  → Switch from Jaeger to Datadog without code changes

The Three Pillars of Observability

javascript

QUBITS OF DPK

11. Metrics: What is happening? (aggregate)
2   "Payment service has 5% error rate"
3   Tools: Prometheus, Grafana
4
52. Logs: What happened in detail? (events)
6   "payment_id=xyz failed with NullPointerException at line 234"
7   Tools: ELK Stack (Elasticsearch + Logstash + Kibana)
8
93. Traces: Where did it happen? (causality)
10   "Request abc-123 failed in Payment Service, called from Trip Service"
11   Tools: Jaeger, Zipkin, Datadog APM
12
13All three together = full observability

5 Interview Questions This Blog Unlocks

Q1. A request is failing in your microservices system — how do you debug it?

Answer: Use distributed tracing (Jaeger/Zipkin). Find the trace ID from logs or error report. Pull up trace in Jaeger UI — waterfall view shows all service calls, durations, errors. Identify the failing span (red in UI). Drill into that service's logs using trace ID as correlation ID. Fix root cause.

Q2. What is a correlation ID and why is every SDE-2 expected to know this?

Answer: Unique ID generated at the entry point of a request, passed as a header to every downstream service. Every service logs this ID. Enables searching all service logs for one request using a single ID. Without it, debugging cross-service failures requires manual correlation of timestamps — nearly impossible at scale.

Q3. What is the difference between metrics, logs, and traces?

Answer: Metrics = aggregated numbers (error rate, latency p99, QPS). Good for alerting and dashboards. Logs = individual events with detail (stack traces, request params). Good for debugging specific failures. Traces = causality chain of one request across services. Good for latency analysis and finding which service failed.

Q4. What is tail-based sampling and why is it better than head-based for Uber?

Answer: Head-based: decide to sample at start of request (e.g., 1% random). Risk: miss the rare slow/failing requests that are most important. Tail-based: buffer all spans, decide to keep after request completes. Keep all errors and slow requests, discard normal ones. At Uber's scale, tail-based ensures every bug gets captured.

Q5. What is OpenTelemetry and why does it matter?

Answer: Vendor-neutral observability standard. Instrument once with OTel SDK, export to any backend (Jaeger, Datadog, New Relic, Honeycomb). Prevents vendor lock-in. Standardizes trace, metric, and log formats. Adopted by all major cloud providers. For new systems today, always start with OTel instrumentation.

Quick Revision

️ The Core Problem

Core Concepts

Trace, Span, Correlation ID

How Jaeger Works

Sampling Strategy

OpenTelemetry — The Standard

The Three Pillars of Observability

5 Interview Questions This Blog Unlocks

Q1. A request is failing in your microservices system — how do you debug it?

Q2. What is a correlation ID and why is every SDE-2 expected to know this?

Q3. What is the difference between metrics, logs, and traces?

Q4. What is tail-based sampling and why is it better than head-based for Uber?

Q5. What is OpenTelemetry and why does it matter?

Key Engineering Lessons