Blog 13 — Dropbox: How We Scaled Storage

C

Qubits of DPK

March 21, 2026

Core Case Studies
Core Concept: Object storage internals, File chunking, Deduplication, S3-like architecture
Why SDE-2 Critical: File storage design is asked at Google, Dropbox, Box, and anywhere handling user-generated content
Status: Draft notes ready

Quick Revision

  • Problem: Store huge files cheaply while supporting resume, sync, and deduplication.
  • Core pattern: Chunking, content-addressed blocks, and separate metadata storage.
  • Interview one-liner: Split bytes from metadata so each can scale in the way it naturally needs to.

️ Architecture Overview

javascript
QUBITS OF DPK
1User uploads file.pdf (100MB)
234Dropbox Client (chunking)
5  └── Split into 4MB chunks
6  └── Hash each chunk (SHA-256)
7  └── Check which chunks already exist (deduplication)
8  └── Upload only NEW chunks
91011Block ServerS3 (or Magic Pocket, Dropbox's own storage)
121314Metadata ServerMySQL (file structure, chunk mapping)

Core Concepts

File Chunking

javascript
QUBITS OF DPK
1Why chunk instead of upload whole file?
2
3Problem with whole file upload:
4  100MB file, upload fails at 99MB → restart entire upload
5  Same file on two devices → 2x storage
6  Edit 1 word in 100MB doc → re-upload entire 100MB
7
8Chunking solution:
9  Split into 4MB chunks
10  Each chunk has SHA-256 hash as ID
11
12  Resume interrupted upload → restart from failed chunk only
13  Deduplication → same chunk across users = stored once
14  Delta sync → only changed chunks re-uploaded

Content-Based Deduplication

javascript
QUBITS OF DPK
1Alice uploads photo.jpg (SHA-256 = "abc123")
2Bob uploads SAME photo.jpg (SHA-256 = "abc123")
3
4Storage:
5  Check: does block "abc123" exist? YES
6Store only pointer for Bob, not duplicate bytes
71 copy in storage, 2 users see it
8
9Result:
10  Dropbox stores ~1.5 billion unique files
11  Without dedup: 10x+ more storage needed

Metadata vs Block Storage (Separation of Concerns)

javascript
QUBITS OF DPK
1Metadata DB (MySQL):
2  Stores: file names, folder structure, chunk hashes, version history
3  Small, relational, queryable
4  Example: file_id=456 is made of chunks [abc123, def456, ghi789]
5
6Block Storage (S3 / Magic Pocket):
7  Stores: actual bytes, addressed by hash
8  Immutable, append-only, no metadata
9  Optimized for large byte storage, not queries
10
11Benefit: Scale each independently
12  Block storage grows with data volume
13  Metadata DB grows with file count (much smaller)

Magic Pocket — Dropbox's Custom Object Storage

javascript
QUBITS OF DPK
1Dropbox left S3 in 2016 to build Magic Pocket:
2  Why? At their scale, S3 costs were $40M+/year
3
4Magic Pocket:
5  Custom hardware (HDDs in 90-disk JBOD enclosures)
6  Custom software (Go-based object store)
7  Erasure coding instead of 3x replication:
8    3x replication: store 3 copies → 200% overhead
9    Erasure coding: 9+3 scheme → 33% overhead
10    Same durability, 6x less storage cost
11
12Result: Saved tens of millions per year

Sync Protocol

javascript
QUBITS OF DPK
1Delta sync (only changed chunks):
2  File edit: only 1 of 25 chunks changed
3Upload 1 chunk (4MB) instead of full file (100MB)
496% bandwidth reduction
5
6Conflict resolution:
7  Both devices edit same file offline
8Create "conflicted copy" for one version
9User resolves manually (Dropbox doesn't auto-merge)

Scale Achieved

5 Interview Questions This Blog Unlocks

Q1. Design Dropbox / Google Drive

Answer: Client chunks files (4MB), hashes each chunk. Check server for existing chunks (deduplication). Upload only new chunks to block storage (S3/custom). Metadata in MySQL (file structure + chunk mapping). Delta sync — only changed chunks on edits. Resumable uploads via chunk-level tracking.

Q2. What is content-based deduplication and why is it powerful?

Answer: Store files by their content hash (SHA-256), not by name. Same bytes → same hash → stored once regardless of how many users have it. If 1M users share a viral video, 1 copy stored. Works at chunk level too — common data (OS files, templates) deduplicated across all users.

Q3. What is erasure coding and how does it compare to replication?

Answer: Replication: store N copies. 3x replication = 200% storage overhead. Erasure coding: split data into K data chunks + M parity chunks. Can reconstruct from any K chunks. 9+3 erasure coding = 33% overhead with same fault tolerance as 3x replication. More complex but 6x storage savings.

Q4. How would you design resumable file uploads?

Answer: Chunk the file client-side. Assign each chunk a sequence number. Track upload progress server-side. On failure, client queries server: "which chunks did you receive?" Server returns missing chunk IDs. Client retransmits only those. Even 10GB uploads can resume from exact failure point.

Q5. Why did Dropbox move off AWS S3 to their own storage?

Answer: At extreme scale (500+ PB), cloud storage costs dominate. Custom hardware + software can achieve same reliability at 80% lower cost. Erasure coding vs S3's replication scheme. Full control over performance optimizations. The upfront engineering investment pays off in years, not decades.

Key Engineering Lessons