Blog 1 — OpenAI: Scaling PostgreSQL to 800M ChatGPT Users

C

Qubits of DPK

March 21, 2026

Core Case Studies
Published: Early 2026
Core Lesson: A single PostgreSQL primary can serve 800M users with the right engineering discipline — no sharding needed.

Quick Revision

  • Problem: Scale a write-heavy relational system without jumping straight to sharding.
  • Core pattern: Read replicas, PgBouncer, caching, and workload isolation.
  • Interview one-liner: Exhaust app-layer and query-level optimizations before reaching for sharding.

️ Architecture Overview

javascript
QUBITS OF DPK
1800M Users
234Load Balancer
567Application Servers (Kubernetes)
89     ├──► Redis Cache  (most requests stop here)
101112PRIMARY PostgreSQL  ◄── ALL writes ONLY
13(Azure Flexible Server)
14WAL logs
1516~50 Read Replicas across regions  ◄── ALL reads

3 Pillars of Their Strategy

Pillar 1 — Offload Reads Aggressively

  • All SELECT queries → replicas (NOT primary)
  • Some reads stay on primary only when they are part of write transactions
  • Tradeoff accepted: Replica lag = slightly stale data. Worth it — reduces primary load by 10x

Pillar 2 — Reduce Write Pressure

  • Migrated write-heavy workloads to Azure CosmosDB (sharded)
  • Fixed application bugs causing redundant writes
  • Introduced lazy writes — buffer in memory, batch INSERT later
  • Enforced rate limits during backfills
Lazy Write Example:
javascript
QUBITS OF DPK
1BAD: User clicks "like"IMMEDIATELY INSERT into DB
2GOOD: User clicks "like"BufferAfter 100 likes → ONE batch INSERT

Pillar 3 — High Availability for Primary

  • Primary runs in HA mode with a hot standby
  • Hot standby = continuously synced replica, ready to promote instantly
  • Result: 99.999% availability (five-nines)

Connection Pooling — PgBouncer (The Hidden Hero)

The Problem:
javascript
QUBITS OF DPK
110,000 App Pods × 1 connection each = 10,000 connections
2PostgreSQL max connections ≈ 5,000
3DB crashes 💥
The Solution — PgBouncer as proxy:
javascript
QUBITS OF DPK
110,000 App Pods
234PgBouncer  (maintains only ~200 real DB connections)
567PostgreSQL  (healthy, sees only 200 connections)
Result: Connection time dropped from 50ms → 5ms (10x improvement)
3 PgBouncer Modes (interviewers love asking this):
OpenAI uses transaction pooling — most efficient for stateless API servers.

Query Optimization

  • Long-running transactions block garbage collection → enforced timeouts
  • ORM-generated queries caused 12-table JOINs → replaced with simpler raw SQL
  • Timeouts enforced:
javascript
QUBITS OF DPK
1idle_in_transaction_session_timeout  → kills idle transactions
2statement_timeout                    → kills runaway queries
3client_timeout                       → kills ghost connections

Thundering Herd Problem + Solution

The Problem:
javascript
QUBITS OF DPK
1Cache entry expires
210,000 requests arrive simultaneously
3ALL miss cache
4ALL hit DB at once
5DB CRASHES 💥
Solution 1 — Cache Locking (OpenAI's approach):
javascript
QUBITS OF DPK
1Request 1 → acquires LOCK → queries DB → fills cache → releases lock
2Requests 2 to 10,000 → see lock → wait
3DB receives only 1 query ✅
Solution 2 — Request Collapsing:
javascript
QUBITS OF DPK
11000 requests for same data → collapsed into 1 DB query → result returned to all 1000
Solution 3 — Cache Expiry Jitter:
javascript
QUBITS OF DPK
1All entries expire at T+3600 → mass expiry storm
2Each entry expires at T+3600 + random(0,300) → staggered, no storm

Cascading Replication (Future-Proofing)

Problem with 50 direct replicas:
javascript
QUBITS OF DPK
1PrimaryWALReplica 1
2PrimaryWALReplica 2
3...
4PrimaryWALReplica 50
5Primary CPU gets crushed just on replication
Cascading Replication Solution:
javascript
QUBITS OF DPK
1Primary
2 ├── Intermediate Replica A
3 │       ├── Replica A1
4 │       └── Replica A2
5 └── Intermediate Replica B
6         ├── Replica B1
7         └── Replica B2
8
9Primary only maintains 2 WAL streams ✅
Tradeoff: Slightly higher replica lag (10ms → 15-20ms). Acceptable for read-heavy, eventually consistent workloads.

️ Workload Isolation

Final Results

5 Interview Questions This Blog Unlocks

Q1. Design the database layer for a system like ChatGPT

Answer framework:
  1. #
    Redis cache layer → 80-90% requests never hit DB
  2. #
    PgBouncer connection pooling → 50ms → 5ms connection time
  3. #
    Single primary for all writes + ~50 read replicas for reads
  4. #
    Workload isolation → write-heavy to CosmosDB
  5. #
    Hot standby for HA → five-nines availability
"Optimize at the application layer before touching the database architecture. Caching + connection pooling + read replicas can take you to 800M users without sharding."

Q2. What is a Thundering Herd problem and how do you solve it?

Definition: When a cached item expires and thousands of simultaneous requests all miss the cache and hit the DB at once → DB collapses.
3 Solutions:
  1. #
    Cache Locking — First request gets a lock, queries DB, fills cache. All others wait for lock.
  2. #
    Request Collapsing — 1000 requests deduplicated into 1 DB query, result returned to all.
  3. #
    Cache Expiry Jitter — Stagger expiry times randomly to avoid simultaneous mass expiry.

Q3. Why would you choose NOT to shard a database?

Problems sharding creates:
  • Cross-shard queries become complex (JOIN across shards)
  • Distributed transactions need 2-phase commit
  • Rebalancing when one shard gets hot
  • Every schema change multiplies by N shards
  • Hundreds of application endpoints need modification
OpenAI's checklist before sharding:
javascript
QUBITS OF DPK
1Read replicas
2Query optimization
3Connection pooling (PgBouncer)
4Caching (Redis)
5Lazy writes + batching
6Workload isolation
7ONLY then consider sharding
"Sharding is a last resort, not a first resort. Every app-layer optimization buys months of runway."

Q4. What is connection pooling and why does it matter at scale?

Problem: 10,000 app pods × 1 connection = 10,000 connections. PostgreSQL max ≈ 5,000. Crash.
Solution: PgBouncer sits between app and DB as a proxy. App pods connect to PgBouncer. PgBouncer maintains a controlled pool of ~200 real DB connections. DB stays healthy.
Result: 50ms → 5ms connection time (10x improvement)
Best mode for APIs: Transaction pooling — connection returned to pool after each transaction.

Q5. What is cascading replication and when would you use it?

Problem: At 50+ replicas, primary must stream WAL to each one — crushing primary CPU.
Solution: Intermediate replicas relay WAL downstream. Primary only talks to 2-3 intermediate nodes. They fan out to the rest.
Use when: 20+ replicas or multi-region expansion.
Tradeoff: Slightly higher replica lag (~5-10ms extra). Acceptable for eventually consistent read workloads.

Key Engineering Lessons (Summary)