Blog 22 - Google: Bigtable : A Distributed Storage System for Structured Data

C

Qubits of DPK

April 8, 2026

Core Case Studies

Core Concept: Wide-column distributed database, Sorted key-value storage, Automatic sharding into tablets, SSTables, GFS-based storage layer, High scalability with horizontal partitioning
Why SDE-2 Critical: NoSQL database internals are asked in system design rounds at every MAANG company
Status: Draft notes ready
Quick Revision
∙ Problem: Store and serve petabytes of structured data with low-latency reads and writes across thousands of machines.
∙ Core pattern: Rows sorted by key → split into tablets → each tablet served by a tablet server → backed by GFS.
∙ Interview one-liner: Bigtable scales structured storage by sharding sorted rows into tablets distributed across machines, with GFS handling durability.
️ Architecture Overview
Client sends read/write request


Chubby Lock Service → locates root tablet


Root Tablet → METADATA Tablets → target Tablet Server


Tablet Server (owns specific row range)
├── Check MemTable → return if hit
├── Check SSTables on disk → merge results
└── Write path → commit log → MemTable → SSTable flush


GFS stores SSTable files durably
Core Concepts
Data Model
(row key, column family:column qualifier, timestamp) → value
Example — Web crawl:
Row key: "com.google.www"
Column family: contents:
Timestamps: t3 → "<html>...</html>"
t2 → "<html>...</html>"
t1 → "<html>...</html>"
Multiple versions of the same cell stored by timestamp.
Row Key Design — Lexicographic Sorting
Normal: www.google.com
Reversed: com.google.www ← stored this way
Why?
All rows sorted lexicographically by row key.
Reversing domain groups all subdomains together.
Range scan: com.google.*
→ Hits com.google.mail, com.google.maps, com.google.www
→ One sequential read, no scatter-gather
→ High latency avoided
Column Families
Table: WebCrawl
Column family: contents → stores page HTML
Column family: anchor → stores inbound links
Column family: language → stores detected language
Rules:
→ Column families defined at table creation (schema)
→ Columns within a family added dynamically (schema-free)
→ Each family stored together on disk (co-location)
Tablets — Automatic Sharding
Table rows: A → Z
Tablet 1: A → F → Tablet Server 1
Tablet 2: G → M → Tablet Server 2
Tablet 3: N → Z → Tablet Server 3
When tablet grows too large → splits automatically into two tablets
New tablet servers added → master reassigns tablets for balance
→ Horizontal scale with no downtime
SSTables — Storage Format
Write path:
  1. #
    Write → commit log (WAL for durability)
  2. #
    Write buffered in MemTable (in-memory sorted map)
  3. #
    MemTable full → flush to immutable SSTable on GFS
Read path:
  1. #
    Check MemTable first (most recent writes)
  2. #
    Check SSTable files on disk
  3. #
    Merge results (latest timestamp wins)
Compaction
Over time, multiple SSTables accumulate → compaction merges them
Minor compaction: MemTable → new SSTable on disk
Merging compaction: several SSTables → fewer SSTables
Major compaction: all SSTables → single SSTable
(permanently removes deleted/expired data)
Chubby Lock Service
Bigtable uses Chubby (distributed lock service) for:
→ Storing location of root tablet
→ Tablet server discovery and liveness detection
→ Schema and access control metadata
→ Master election (only one active master at a time)
If Chubby goes down → Bigtable becomes unavailable
→ Chubby is a hard dependency
Scale Achieved
5 Interview Questions This Blog Unlocks
Q1. How does Bigtable achieve horizontal scalability?
Answer: Data is split into tablets (row ranges) distributed across tablet servers. As data grows, tablets split automatically and new servers absorb them. The master handles assignment — no single machine is a bottleneck.
Q2. Why are row keys reversed in Bigtable?
Answer: Rows are sorted lexicographically. Reversing domain names (com.google.www) groups all subdomains together, enabling efficient range scans per domain in one sequential read instead of random scatter-gather.
Q3. What is the write path in Bigtable?
Answer: Write goes to the commit log (WAL for crash durability), then to the in-memory MemTable. When MemTable is full, it flushes to an immutable SSTable on GFS. Reads merge MemTable and SSTable results, returning the latest timestamp.
Q4. How does Bigtable handle tablet server failures?
Answer: Master detects failure via Chubby heartbeat loss. Tablets owned by the failed server are reassigned to healthy servers. GFS durability ensures zero data loss since SSTables are replicated at the storage layer.
Q5. How is Bigtable different from a relational database?
Answer: Bigtable is schema-flexible (columns added dynamically per row), optimized for wide sparse tables, and scales horizontally via tablet sharding. Relational databases use fixed schemas, support joins and ACID transactions, but cannot scale to petabytes without massive complexity.
Key Engineering Lessons