Blog 1 — OpenAI: Scaling PostgreSQL to 800M ChatGPT Users

Qubits of DPK

March 21, 2026

Core Case Studies

Source: https://openai.com/index/scaling-postgresql/

Published: Early 2026

Core Lesson: A single PostgreSQL primary can serve 800M users with the right engineering discipline — no sharding needed.

Quick Revision

Problem: Scale a write-heavy relational system without jumping straight to sharding.
Core pattern: Read replicas, PgBouncer, caching, and workload isolation.
Interview one-liner: Exhaust app-layer and query-level optimizations before reaching for sharding.

️ Architecture Overview

javascript

QUBITS OF DPK

1800M Users
2     │
3     ▼
4Load Balancer
5     │
6     ▼
7Application Servers (Kubernetes)
8     │
9     ├──► Redis Cache  (most requests stop here)
10     │
11     ▼
12PRIMARY PostgreSQL  ◄── ALL writes ONLY
13(Azure Flexible Server)
14     │  WAL logs
15     ▼
16~50 Read Replicas across regions  ◄── ALL reads

3 Pillars of Their Strategy

Pillar 1 — Offload Reads Aggressively

All SELECT queries → replicas (NOT primary)
Some reads stay on primary only when they are part of write transactions
Tradeoff accepted: Replica lag = slightly stale data. Worth it — reduces primary load by 10x

Pillar 2 — Reduce Write Pressure

Migrated write-heavy workloads to Azure CosmosDB (sharded)
Fixed application bugs causing redundant writes
Introduced lazy writes — buffer in memory, batch INSERT later
Enforced rate limits during backfills

Lazy Write Example:

javascript

QUBITS OF DPK

1❌ BAD: User clicks "like" → IMMEDIATELY INSERT into DB
2✅ GOOD: User clicks "like" → Buffer → After 100 likes → ONE batch INSERT

Pillar 3 — High Availability for Primary

Primary runs in HA mode with a hot standby
Hot standby = continuously synced replica, ready to promote instantly
Result: 99.999% availability (five-nines)

Connection Pooling — PgBouncer (The Hidden Hero)

The Problem:

javascript

QUBITS OF DPK

110,000 App Pods × 1 connection each = 10,000 connections
2PostgreSQL max connections ≈ 5,000
3→ DB crashes 💥

The Solution — PgBouncer as proxy:

javascript

QUBITS OF DPK

110,000 App Pods
2      │
3      ▼
4PgBouncer  (maintains only ~200 real DB connections)
5      │
6      ▼
7PostgreSQL  (healthy, sees only 200 connections)

Result: Connection time dropped from 50ms → 5ms (10x improvement)

3 PgBouncer Modes (interviewers love asking this):

OpenAI uses transaction pooling — most efficient for stateless API servers.

Query Optimization

Long-running transactions block garbage collection → enforced timeouts
ORM-generated queries caused 12-table JOINs → replaced with simpler raw SQL
Timeouts enforced:

javascript

QUBITS OF DPK

1idle_in_transaction_session_timeout  → kills idle transactions
2statement_timeout                    → kills runaway queries
3client_timeout                       → kills ghost connections

Thundering Herd Problem + Solution

The Problem:

javascript

QUBITS OF DPK

1Cache entry expires
2→ 10,000 requests arrive simultaneously
3→ ALL miss cache
4→ ALL hit DB at once
5→ DB CRASHES 💥

Solution 1 — Cache Locking (OpenAI's approach):

javascript

QUBITS OF DPK

1Request 1 → acquires LOCK → queries DB → fills cache → releases lock
2Requests 2 to 10,000 → see lock → wait
3→ DB receives only 1 query ✅

Solution 2 — Request Collapsing:

javascript

QUBITS OF DPK

11000 requests for same data → collapsed into 1 DB query → result returned to all 1000

Solution 3 — Cache Expiry Jitter:

javascript

QUBITS OF DPK

1❌ All entries expire at T+3600 → mass expiry storm
2✅ Each entry expires at T+3600 + random(0,300) → staggered, no storm

Cascading Replication (Future-Proofing)

Problem with 50 direct replicas:

javascript

QUBITS OF DPK

1Primary → WAL → Replica 1
2Primary → WAL → Replica 2
3...
4Primary → WAL → Replica 50
5← Primary CPU gets crushed just on replication

Cascading Replication Solution:

javascript

QUBITS OF DPK

1Primary
2 ├── Intermediate Replica A
3 │       ├── Replica A1
4 │       └── Replica A2
5 └── Intermediate Replica B
6         ├── Replica B1
7         └── Replica B2
8
9Primary only maintains 2 WAL streams ✅

Tradeoff: Slightly higher replica lag (10ms → 15-20ms). Acceptable for read-heavy, eventually consistent workloads.

️ Workload Isolation

Final Results

5 Interview Questions This Blog Unlocks

Q1. Design the database layer for a system like ChatGPT

Answer framework:

#
Redis cache layer → 80-90% requests never hit DB
#
PgBouncer connection pooling → 50ms → 5ms connection time
#
Single primary for all writes + ~50 read replicas for reads
#
Workload isolation → write-heavy to CosmosDB
#
Hot standby for HA → five-nines availability

"Optimize at the application layer before touching the database architecture. Caching + connection pooling + read replicas can take you to 800M users without sharding."

Q2. What is a Thundering Herd problem and how do you solve it?

Definition: When a cached item expires and thousands of simultaneous requests all miss the cache and hit the DB at once → DB collapses.

3 Solutions:

#
Cache Locking — First request gets a lock, queries DB, fills cache. All others wait for lock.
#
Request Collapsing — 1000 requests deduplicated into 1 DB query, result returned to all.
#
Cache Expiry Jitter — Stagger expiry times randomly to avoid simultaneous mass expiry.

Q3. Why would you choose NOT to shard a database?

Problems sharding creates:

Cross-shard queries become complex (JOIN across shards)
Distributed transactions need 2-phase commit
Rebalancing when one shard gets hot
Every schema change multiplies by N shards
Hundreds of application endpoints need modification

OpenAI's checklist before sharding:

javascript

QUBITS OF DPK

1✅ Read replicas
2✅ Query optimization
3✅ Connection pooling (PgBouncer)
4✅ Caching (Redis)
5✅ Lazy writes + batching
6✅ Workload isolation
7→ ONLY then consider sharding

"Sharding is a last resort, not a first resort. Every app-layer optimization buys months of runway."

Q4. What is connection pooling and why does it matter at scale?

Problem: 10,000 app pods × 1 connection = 10,000 connections. PostgreSQL max ≈ 5,000. Crash.

Solution: PgBouncer sits between app and DB as a proxy. App pods connect to PgBouncer. PgBouncer maintains a controlled pool of ~200 real DB connections. DB stays healthy.

Result: 50ms → 5ms connection time (10x improvement)

Best mode for APIs: Transaction pooling — connection returned to pool after each transaction.

Q5. What is cascading replication and when would you use it?

Problem: At 50+ replicas, primary must stream WAL to each one — crushing primary CPU.

Solution: Intermediate replicas relay WAL downstream. Primary only talks to 2-3 intermediate nodes. They fan out to the rest.

Use when: 20+ replicas or multi-region expansion.

Tradeoff: Slightly higher replica lag (~5-10ms extra). Acceptable for eventually consistent read workloads.

Quick Revision

️ Architecture Overview

3 Pillars of Their Strategy

Pillar 1 — Offload Reads Aggressively

Pillar 2 — Reduce Write Pressure

Pillar 3 — High Availability for Primary

Connection Pooling — PgBouncer (The Hidden Hero)

Query Optimization

Thundering Herd Problem + Solution

Cascading Replication (Future-Proofing)

️ Workload Isolation

Final Results

5 Interview Questions This Blog Unlocks

Q1. Design the database layer for a system like ChatGPT

Q2. What is a Thundering Herd problem and how do you solve it?

Q3. Why would you choose NOT to shard a database?

Q4. What is connection pooling and why does it matter at scale?

Q5. What is cascading replication and when would you use it?

Key Engineering Lessons (Summary)