Blog 22 - Google: Bigtable : A Distributed Storage System for Structured Data

Qubits of DPK

April 8, 2026

Core Case Studies

Source: https://research.google/pubs/pub27898/

Core Concept: Wide-column distributed database, Sorted key-value storage, Automatic sharding into tablets, SSTables, GFS-based storage layer, High scalability with horizontal partitioning
Why SDE-2 Critical: NoSQL database internals are asked in system design rounds at every MAANG company
Status: Draft notes ready

Quick Revision
∙ Problem: Store and serve petabytes of structured data with low-latency reads and writes across thousands of machines.
∙ Core pattern: Rows sorted by key → split into tablets → each tablet served by a tablet server → backed by GFS.
∙ Interview one-liner: Bigtable scales structured storage by sharding sorted rows into tablets distributed across machines, with GFS handling durability.

️ Architecture Overview

Client sends read/write request
│
▼
Chubby Lock Service → locates root tablet
│
▼
Root Tablet → METADATA Tablets → target Tablet Server
│
▼
Tablet Server (owns specific row range)
├── Check MemTable → return if hit
├── Check SSTables on disk → merge results
└── Write path → commit log → MemTable → SSTable flush
│
▼
GFS stores SSTable files durably

Core Concepts
Data Model

(row key, column family:column qualifier, timestamp) → value

Example — Web crawl:
Row key: "com.google.www"
Column family: contents:
Timestamps: t3 → "<html>...</html>"
t2 → "<html>...</html>"
t1 → "<html>...</html>"

Multiple versions of the same cell stored by timestamp.

Row Key Design — Lexicographic Sorting

Normal: www.google.com
Reversed: com.google.www ← stored this way

Why?
All rows sorted lexicographically by row key.
Reversing domain groups all subdomains together.

Range scan: com.google.*
→ Hits com.google.mail, com.google.maps, com.google.www
→ One sequential read, no scatter-gather
→ High latency avoided

Column Families

Table: WebCrawl

Column family: contents → stores page HTML
Column family: anchor → stores inbound links
Column family: language → stores detected language

Rules:
→ Column families defined at table creation (schema)
→ Columns within a family added dynamically (schema-free)
→ Each family stored together on disk (co-location)

Tablets — Automatic Sharding

Table rows: A → Z

Tablet 1: A → F → Tablet Server 1
Tablet 2: G → M → Tablet Server 2
Tablet 3: N → Z → Tablet Server 3

When tablet grows too large → splits automatically into two tablets
New tablet servers added → master reassigns tablets for balance
→ Horizontal scale with no downtime

SSTables — Storage Format

Write path:

#
Write → commit log (WAL for durability)
#
Write buffered in MemTable (in-memory sorted map)
#
MemTable full → flush to immutable SSTable on GFS

Read path:

#
Check MemTable first (most recent writes)
#
Check SSTable files on disk
#
Merge results (latest timestamp wins)

Compaction

Over time, multiple SSTables accumulate → compaction merges them

Minor compaction: MemTable → new SSTable on disk
Merging compaction: several SSTables → fewer SSTables
Major compaction: all SSTables → single SSTable
(permanently removes deleted/expired data)

Chubby Lock Service

Bigtable uses Chubby (distributed lock service) for:
→ Storing location of root tablet
→ Tablet server discovery and liveness detection
→ Schema and access control metadata
→ Master election (only one active master at a time)

If Chubby goes down → Bigtable becomes unavailable
→ Chubby is a hard dependency

Scale Achieved

5 Interview Questions This Blog Unlocks
Q1. How does Bigtable achieve horizontal scalability?
Answer: Data is split into tablets (row ranges) distributed across tablet servers. As data grows, tablets split automatically and new servers absorb them. The master handles assignment — no single machine is a bottleneck.
Q2. Why are row keys reversed in Bigtable?
Answer: Rows are sorted lexicographically. Reversing domain names (com.google.www) groups all subdomains together, enabling efficient range scans per domain in one sequential read instead of random scatter-gather.
Q3. What is the write path in Bigtable?
Answer: Write goes to the commit log (WAL for crash durability), then to the in-memory MemTable. When MemTable is full, it flushes to an immutable SSTable on GFS. Reads merge MemTable and SSTable results, returning the latest timestamp.
Q4. How does Bigtable handle tablet server failures?
Answer: Master detects failure via Chubby heartbeat loss. Tablets owned by the failed server are reassigned to healthy servers. GFS durability ensures zero data loss since SSTables are replicated at the storage layer.
Q5. How is Bigtable different from a relational database?
Answer: Bigtable is schema-flexible (columns added dynamically per row), optimized for wide sparse tables, and scales horizontally via tablet sharding. Relational databases use fixed schemas, support joins and ACID transactions, but cannot scale to petabytes without massive complexity.

Key Engineering Lessons