Blog 17 — Uber: Distributed Tracing with Jaeger
C
Qubits of DPK
March 21, 2026
Core Case Studies
Core Concept: Distributed tracing, Correlation IDs, Jaeger/OpenTelemetry, Debugging failures across microservices
Why SDE-2 Critical: SDE-2 owns production — this topic separates seniors from juniors in interviews
Status: Draft notes ready
Quick Revision
- Problem: One failing request touches many services and hides the root cause.
- Core pattern: Trace IDs, spans, sampling, and a tracing backend like Jaeger.
- Interview one-liner: Metrics tell you something is wrong; traces tell you where the request actually broke.
️ The Core Problem
javascript
QUBITS OF DPK
Core Concepts
Trace, Span, Correlation ID
javascript
QUBITS OF DPK
How Jaeger Works
javascript
QUBITS OF DPK
Sampling Strategy
javascript
QUBITS OF DPK
OpenTelemetry — The Standard
javascript
QUBITS OF DPK
The Three Pillars of Observability
javascript
QUBITS OF DPK
5 Interview Questions This Blog Unlocks
Q1. A request is failing in your microservices system — how do you debug it?
Answer: Use distributed tracing (Jaeger/Zipkin). Find the trace ID from logs or error report. Pull up trace in Jaeger UI — waterfall view shows all service calls, durations, errors. Identify the failing span (red in UI). Drill into that service's logs using trace ID as correlation ID. Fix root cause.
Q2. What is a correlation ID and why is every SDE-2 expected to know this?
Answer: Unique ID generated at the entry point of a request, passed as a header to every downstream service. Every service logs this ID. Enables searching all service logs for one request using a single ID. Without it, debugging cross-service failures requires manual correlation of timestamps — nearly impossible at scale.
Q3. What is the difference between metrics, logs, and traces?
Answer: Metrics = aggregated numbers (error rate, latency p99, QPS). Good for alerting and dashboards. Logs = individual events with detail (stack traces, request params). Good for debugging specific failures. Traces = causality chain of one request across services. Good for latency analysis and finding which service failed.
Q4. What is tail-based sampling and why is it better than head-based for Uber?
Answer: Head-based: decide to sample at start of request (e.g., 1% random). Risk: miss the rare slow/failing requests that are most important. Tail-based: buffer all spans, decide to keep after request completes. Keep all errors and slow requests, discard normal ones. At Uber's scale, tail-based ensures every bug gets captured.
Q5. What is OpenTelemetry and why does it matter?
Answer: Vendor-neutral observability standard. Instrument once with OTel SDK, export to any backend (Jaeger, Datadog, New Relic, Honeycomb). Prevents vendor lock-in. Standardizes trace, metric, and log formats. Adopted by all major cloud providers. For new systems today, always start with OTel instrumentation.