Blog 13 — Dropbox: How We Scaled Storage
C
Qubits of DPK
March 21, 2026
Core Case Studies
Core Concept: Object storage internals, File chunking, Deduplication, S3-like architecture
Why SDE-2 Critical: File storage design is asked at Google, Dropbox, Box, and anywhere handling user-generated content
Status: Draft notes ready
Quick Revision
- Problem: Store huge files cheaply while supporting resume, sync, and deduplication.
- Core pattern: Chunking, content-addressed blocks, and separate metadata storage.
- Interview one-liner: Split bytes from metadata so each can scale in the way it naturally needs to.
️ Architecture Overview
javascript
QUBITS OF DPK
Core Concepts
File Chunking
javascript
QUBITS OF DPK
Content-Based Deduplication
javascript
QUBITS OF DPK
Metadata vs Block Storage (Separation of Concerns)
javascript
QUBITS OF DPK
Magic Pocket — Dropbox's Custom Object Storage
javascript
QUBITS OF DPK
Sync Protocol
javascript
QUBITS OF DPK
Scale Achieved
5 Interview Questions This Blog Unlocks
Q1. Design Dropbox / Google Drive
Answer: Client chunks files (4MB), hashes each chunk. Check server for existing chunks (deduplication). Upload only new chunks to block storage (S3/custom). Metadata in MySQL (file structure + chunk mapping). Delta sync — only changed chunks on edits. Resumable uploads via chunk-level tracking.
Q2. What is content-based deduplication and why is it powerful?
Answer: Store files by their content hash (SHA-256), not by name. Same bytes → same hash → stored once regardless of how many users have it. If 1M users share a viral video, 1 copy stored. Works at chunk level too — common data (OS files, templates) deduplicated across all users.
Q3. What is erasure coding and how does it compare to replication?
Answer: Replication: store N copies. 3x replication = 200% storage overhead. Erasure coding: split data into K data chunks + M parity chunks. Can reconstruct from any K chunks. 9+3 erasure coding = 33% overhead with same fault tolerance as 3x replication. More complex but 6x storage savings.
Q4. How would you design resumable file uploads?
Answer: Chunk the file client-side. Assign each chunk a sequence number. Track upload progress server-side. On failure, client queries server: "which chunks did you receive?" Server returns missing chunk IDs. Client retransmits only those. Even 10GB uploads can resume from exact failure point.
Q5. Why did Dropbox move off AWS S3 to their own storage?
Answer: At extreme scale (500+ PB), cloud storage costs dominate. Custom hardware + software can achieve same reliability at 80% lower cost. Erasure coding vs S3's replication scheme. Full control over performance optimizations. The upfront engineering investment pays off in years, not decades.