Blog 5 — Discord: How We Store Billions of Messages

C

Qubits of DPK

March 21, 2026

Core Case Studies
Core Concept: MongoDB → Cassandra → ScyllaDB migration, Time-series data storage
Why SDE-2 Critical: Storage decisions at scale — why you pick which DB for which workload
Status: Draft notes ready

Quick Revision

  • Problem: Store massive chat history with fast append and recent-message reads.
  • Core pattern: Time-bucketed wide-column storage with Cassandra/ScyllaDB.
  • Interview one-liner: Pick a database around the access pattern, not the brand name.

️ The Journey: 3 Database Migrations

javascript
QUBITS OF DPK
12015: MongoDB
2  └── Hits scaling limits → too slow for message history
342017: Apache Cassandra
5  └── Works but operational complexity grows
672023: ScyllaDB (Cassandra-compatible, written in C++)
8  └── Current solution — 4x better performance

Core Concepts

Why MongoDB Failed

javascript
QUBITS OF DPK
1MongoDB stores messages as documents:
2  {
3    channel_id: "123",
4    messages: [ msg1, msg2, msg3... ]
5  }
6
7Problem:
8  - Hot channels get huge documents
9  - Random access to old messages = full document scan
10  - No native time-series support
11  - Uneven data distribution across shards

Why Cassandra Was Chosen

  • Designed for time-series data (perfect for chat messages)
  • Wide column model: partition = channel, rows = messages ordered by time
  • Linear horizontal scaling — add nodes, capacity grows linearly
  • Tunable consistency — Discord chose eventual consistency (fine for chat)

Cassandra Data Model for Messages

javascript
QUBITS OF DPK
1Partition Key: (channel_id, bucket)
2Clustering Key: message_id (snowflake timestamp-based)
3
4Example:
5  Partition: (channel_123, 2024-01)
6    Row: msg_id=1710000001"Hey!"
7    Row: msg_id=1710000002"How are you?"
8    Row: msg_id=1710000003"Great thanks!"
9
10Fetch last 50 messages:
11  SELECT * FROM messages
12  WHERE channel_id = 123 AND bucket = '2024-01'
13  ORDER BY message_id DESC LIMIT 50
14Single partition read → extremely fast

The Bucket Problem

  • A channel with 10 years of messages in one partition = too large
  • Discord splits by time bucket (e.g., monthly)
  • Old bucket = cold storage, recent bucket = hot

Why ScyllaDB Over Cassandra

javascript
QUBITS OF DPK
1Cassandra: Written in JavaJVM GC pauses → latency spikes
2ScyllaDB:  Written in C++ → no GC → predictable low latency
3
4Result at Discord:
5  - 4x better performance
6  - Same Cassandra query language (CQL)
7  - No application code changes needed
8  - Fewer nodes needed → cost reduction

Scale Achieved

5 Interview Questions This Blog Unlocks

Q1. Design a chat system like WhatsApp / Slack

Answer: Use Cassandra/ScyllaDB with partition key = (channel_id, time_bucket), clustering key = message_id. Recent messages read from hot partition. Historical messages from cold buckets. WebSockets for real-time delivery. Kafka for fan-out to group members.

Q2. Why is Cassandra good for time-series data?

Answer: Cassandra's wide column model naturally maps to time-series. Partition = entity (channel/user/device), rows = time-ordered events. Append-only writes are extremely fast. Range queries by time are efficient. No JOINs needed.

Q3. What is the difference between relational and wide-column databases?

Answer: Relational: fixed schema, rows have same columns, optimized for JOINs and transactions. Wide-column: flexible schema, rows in same partition can have different columns, optimized for massive write throughput and range queries by partition + clustering key.

Q4. How would you handle message deletion in a Cassandra-based chat system?

Answer: Cassandra uses tombstones for deletion (soft delete marker). Hard deletes are expensive. Better approach: mark message as deleted in application layer, filter on read. Compact tombstones periodically. Never rely on frequent hard deletes in Cassandra.

Q5. Why did Discord migrate from Cassandra to ScyllaDB with zero downtime?

Answer: ScyllaDB is wire-compatible with Cassandra (same CQL protocol). Migration strategy: run both in parallel → double-write to both → gradually shift reads to ScyllaDB → verify data consistency → decommission Cassandra. Application code unchanged.

Key Engineering Lessons