Blog 2 — Uber: Scaling Kafka to 1.1 Trillion Messages/Day
C
Qubits of DPK
March 21, 2026
Core Case Studies
Core Concept: Event-driven architecture, Message queues at scale, Partitioning strategy
Why SDE-2 Critical: Messaging/event systems appear in every backend architecture — Zomato, Swiggy, Amazon all use Kafka
Status: Draft notes ready
Quick Revision
- Problem: How to move trillions of events without tightly coupling services.
- Core pattern: Partitioned Kafka topics plus consumer groups.
- Interview one-liner: Kafka turns synchronous service chains into durable async pipelines.
️ Architecture Overview
javascript
QUBITS OF DPK
Core Concepts
What is Kafka?
- Distributed message queue / event streaming platform
- Producers write events → Kafka stores them → Consumers read at their own pace
- Events are persisted on disk — not lost after consumption unlike traditional queues
Why Uber Needed Kafka
- 1.1 trillion messages per day = ~12.7 million messages per second
- Services are decoupled — payment service doesn't directly call notification service
- Handles traffic spikes gracefully — consumers process at their own pace
Partitioning Strategy
javascript
QUBITS OF DPK
- Same city's events always go to same partition → ordering guaranteed per city
- Multiple partitions → parallelism across consumer instances
Consumer Groups
javascript
QUBITS OF DPK
- Multiple services can consume the SAME event independently
- Each group maintains its own offset (position)
Scale Achieved
5 Interview Questions This Blog Unlocks
Q1. Design a real-time notification system for Uber
Answer: Producer (trip service) → Kafka topic (trip-events) → Consumer (notification service) → Push notification. Kafka decouples them — if notification service is slow, trip service is unaffected.
Q2. Why use Kafka instead of direct API calls between services?
Answer: Direct calls = tight coupling. If notification service is down, trip service fails too. Kafka = async buffer. Trip service writes event and moves on. Notification service processes when ready. Resilience + decoupling.
Q3. How does Kafka guarantee message ordering?
Answer: Ordering is guaranteed within a partition only. If you need all events for a user in order → use userId as partition key → all their events go to same partition → ordered.
Q4. What is consumer lag and why does it matter?
Answer: Consumer lag = difference between latest offset in partition and consumer's current offset. High lag = consumer is falling behind. At Uber scale, monitoring lag is critical — it signals a slow consumer before it becomes an outage.
Q5. How does Kafka handle a broker failure?
Answer: Each partition has a leader and replicas on other brokers. If leader fails → one replica is automatically elected as new leader. Producers and consumers connect to new leader. Replication factor of 3 means 2 broker failures are tolerated.