Blog 16 — LinkedIn: Building Real-Time Search

C

Qubits of DPK

March 21, 2026

Core Case Studies
Core Concept: Inverted index, Ranking algorithms, Typeahead/autocomplete at scale, Elasticsearch internals
Why SDE-2 Critical: Search exists in every product — Swiggy restaurant search, Amazon product search, LinkedIn people search
Status: Draft notes ready

Quick Revision

  • Problem: Search must feel instant even with huge corpora and constant updates.
  • Core pattern: Inverted index, prefix search, and async indexing pipelines.
  • Interview one-liner: Search is two systems: fast retrieval and smart ranking.

️ Architecture Overview

javascript
QUBITS OF DPK
1User types "Deepak software engineer"
234Typeahead Service (suggestions as you type)
567Search Service
8  ├── Query parsing
9  ├── Inverted index lookup
10  ├── Ranking (relevance + network proximity)
11  └── Return ranked results

Core Concepts

Inverted Index — The Heart of Search

javascript
QUBITS OF DPK
1Forward index (naive):
2  Doc1: "Deepak is a software engineer"
3  Doc2: "Alice is a data engineer"
4  Doc3: "Bob is a software architect"
5
6Inverted index (how search actually works):
7  "deepak"[Doc1]
8  "software"[Doc1, Doc3]
9  "engineer"[Doc1, Doc2]
10  "alice"[Doc2]
11  "architect"[Doc3]
12
13Query: "software engineer"
14  Lookup "software"[Doc1, Doc3]
15  Lookup "engineer"[Doc1, Doc2]
16  Intersection[Doc1]  ← answer in O(1) per term

Typeahead / Autocomplete at Scale

javascript
QUBITS OF DPK
1User types "soft":
2  Option 1: Trie data structure
3    Root → s → so → sof → soft → software*, softball*...
4    Traverse to "soft" node → return all descendants
5    O(prefix_length) lookup
6    Problem: 900M users → trie too large for RAM
7
8  Option 2: Redis prefix search
9    Store all prefixes: "s", "so", "sof", "soft", "softw"...
10    ZRANGEBYLEX to find suggestions
11    Fast but high memory usage
12
13  Option 3: Elasticsearch prefix query
14    Pre-analyzed terms stored in inverted index
15    Most scalable for large user bases
16    LinkedIn's approach

Ranking Algorithm

javascript
QUBITS OF DPK
1Raw search: 10,000 "software engineers" match query
2Ranking factors LinkedIn uses:
3
4  Relevance score (TF-IDF / BM25):
5    How many times does term appear in profile?
6    How rare is the term across all profiles?
7
8  Network proximity:
9    1st degree connections ranked highest
10    2nd degree next
11    Out-of-network last
12
13  Engagement signals:
14    Profile completion %
15    Recent activity
16    InMail response rate
17
18  Final score = weighted combination of all factors

Real-Time Index Updates

javascript
QUBITS OF DPK
1User updates their profile:
2  Write to DB → publish event to Kafka
345  Indexing Service consumes event
678  Update Elasticsearch index
91011  New profile searchable within seconds
12
13Not instantaneous (eventual consistency in search is acceptable)

Scale at LinkedIn

5 Interview Questions This Blog Unlocks

Q1. Design Google Search autocomplete / typeahead

Answer: Trie for small dataset. Elasticsearch prefix queries at scale. Cache top-N suggestions per prefix in Redis (TTL 1 hour). Pre-compute popular prefixes offline. Personalize by adding user's search history weight. Return top 5-10 suggestions ranked by global frequency + personal relevance.

Q2. What is an inverted index and how does search use it?

Answer: Maps each word to the list of documents containing it. Query lookup = find each term's posting list, intersect them, rank results. Enables O(1) per-term lookup vs O(N) full scan. Elasticsearch, Solr, and Lucene are all built on inverted indexes. Building index is expensive but lookup is extremely fast.

Q3. How would you design search for an e-commerce app like Amazon?

Answer: Inverted index for text (product name, description, brand). Faceted search for filters (price range, category, rating). Ranking: text relevance + purchase history + sponsored boost + rating. Real-time index updates via Kafka → Elasticsearch. Typeahead via prefix index. A/B test ranking algorithms continuously.

Q4. What is TF-IDF and why is it used for ranking?

Answer: TF (Term Frequency) = how often term appears in document. IDF (Inverse Document Frequency) = how rare the term is across all documents. TF-IDF = TF * IDF. "Software" appears everywhere → low IDF → low weight. "Kubernetes" is rare → high IDF → high weight. Documents with rare matching terms rank higher.

Q5. How do you keep search results fresh when data changes constantly?

Answer: Event-driven index updates. DB write → Kafka event → indexing consumer updates Elasticsearch. Accept eventual consistency for search (seconds of lag is fine). For critical updates (deleted content), use higher priority queue. For bulk updates, use offline batch reindexing during low-traffic hours.

Key Engineering Lessons