Blog 2 — Uber: Scaling Kafka to 1.1 Trillion Messages/Day

C

Qubits of DPK

March 21, 2026

Core Case Studies
Core Concept: Event-driven architecture, Message queues at scale, Partitioning strategy
Why SDE-2 Critical: Messaging/event systems appear in every backend architecture — Zomato, Swiggy, Amazon all use Kafka
Status: Draft notes ready

Quick Revision

  • Problem: How to move trillions of events without tightly coupling services.
  • Core pattern: Partitioned Kafka topics plus consumer groups.
  • Interview one-liner: Kafka turns synchronous service chains into durable async pipelines.

️ Architecture Overview

javascript
QUBITS OF DPK
1Producer (App Server)
234Kafka Broker Cluster
5  ├── Topic: order-events     (partitioned)
6  ├── Topic: payment-events   (partitioned)
7  └── Topic: driver-events    (partitioned)
8910Consumer Groups
11  ├── Notification Service
12  ├── Analytics Service
13  └── Billing Service

Core Concepts

What is Kafka?

  • Distributed message queue / event streaming platform
  • Producers write events → Kafka stores them → Consumers read at their own pace
  • Events are persisted on disk — not lost after consumption unlike traditional queues

Why Uber Needed Kafka

  • 1.1 trillion messages per day = ~12.7 million messages per second
  • Services are decoupled — payment service doesn't directly call notification service
  • Handles traffic spikes gracefully — consumers process at their own pace

Partitioning Strategy

javascript
QUBITS OF DPK
1Topic: trip-events
2  ├── Partition 0 → all events for city_id: Mumbai
3  ├── Partition 1 → all events for city_id: Delhi
4  └── Partition 2 → all events for city_id: Bangalore
  • Same city's events always go to same partition → ordering guaranteed per city
  • Multiple partitions → parallelism across consumer instances

Consumer Groups

javascript
QUBITS OF DPK
1Notification Consumer Group:
2  Instance 1 → reads Partition 0
3  Instance 2 → reads Partition 1
4  Instance 3 → reads Partition 2
5
6Analytics Consumer Group:
7  Instance 1 → reads ALL partitions independently
  • Multiple services can consume the SAME event independently
  • Each group maintains its own offset (position)

Scale Achieved

5 Interview Questions This Blog Unlocks

Q1. Design a real-time notification system for Uber

Answer: Producer (trip service) → Kafka topic (trip-events) → Consumer (notification service) → Push notification. Kafka decouples them — if notification service is slow, trip service is unaffected.

Q2. Why use Kafka instead of direct API calls between services?

Answer: Direct calls = tight coupling. If notification service is down, trip service fails too. Kafka = async buffer. Trip service writes event and moves on. Notification service processes when ready. Resilience + decoupling.

Q3. How does Kafka guarantee message ordering?

Answer: Ordering is guaranteed within a partition only. If you need all events for a user in order → use userId as partition key → all their events go to same partition → ordered.

Q4. What is consumer lag and why does it matter?

Answer: Consumer lag = difference between latest offset in partition and consumer's current offset. High lag = consumer is falling behind. At Uber scale, monitoring lag is critical — it signals a slow consumer before it becomes an outage.

Q5. How does Kafka handle a broker failure?

Answer: Each partition has a leader and replicas on other brokers. If leader fails → one replica is automatically elected as new leader. Producers and consumers connect to new leader. Replication factor of 3 means 2 broker failures are tolerated.

Key Engineering Lessons