3.3. Vector Databases and Indexing

Generative AI & LLM Interview Guide for Top Roles (2025)

6 min read 1124 words

🪄 Step 1: Intuition & Motivation

Core Idea: Imagine you have 100 million documents, and you want to find the ones most semantically similar to a given question — in less than a second.

That’s the challenge RAG systems face daily. While embedding models give us meaningful vectors, the real trick is finding the right neighbors quickly in this huge vector jungle.

That’s where vector databases and indexing structures come in — they organize these high-dimensional points in a way that balances speed, accuracy, and scalability.

Simple Analogy: Think of embeddings as stars in a galaxy. 🌌 You’re looking for the stars closest to a new one you just placed. Instead of checking the distance to every single star (which would take forever), vector databases use clever maps — indexes — to zoom into the right neighborhood instantly.

🌱 Step 2: Core Concept

Let’s break down the key ideas:

What vector databases are.
The indexing methods they use.
How to tune for recall and latency.

1️⃣ What Are Vector Databases?

A vector database is designed to store and search embeddings (numerical vectors). Unlike traditional databases that use SQL for exact matches, vector databases perform similarity search — finding the nearest neighbors of a query vector.

When a user asks a question:

Convert it to an embedding (q_vec).
Compare it with all stored document embeddings (d_vec).
Return top-k most similar vectors based on cosine similarity or dot product.

But doing this comparison naively (a.k.a. brute force search) for millions of vectors is slow — O(n) per query. So databases use indexing to speed this up.

Popular vector stores:

🧠 FAISS (Facebook AI Similarity Search) — open-source, fast, reliable.
⚡ Milvus — distributed and highly scalable.
🧩 Chroma — simple and lightweight for local dev.
☁️ Pinecone — managed cloud service.
🌐 Weaviate — supports hybrid (text + vector) search and modular pipelines.

Vector databases don’t just store — they index and optimize proximity search so retrieval feels instant, even at billion-scale.

2️⃣ Indexing Structures — How Fast Search Works

Indexes determine how a database organizes embeddings to make searches efficient. Let’s explore the most common types:

Index Type	Description	Pros	Cons
Flat (Brute-Force)	Computes distance to all vectors.	100% accuracy.	Very slow for large datasets.
IVF (Inverted File Index)	Divides vectors into clusters; searches only the most relevant clusters.	Fast, good recall trade-off.	May skip some close neighbors.
HNSW (Hierarchical Navigable Small World)	Builds a graph where similar vectors are connected by edges; uses greedy traversal for search.	Extremely fast; excellent recall.	More memory usage; complex index building.
PQ (Product Quantization)	Compresses vectors into compact codes; often combined with IVF.	Saves storage; fast.	Slight loss of precision.

Practical Combination:

For large-scale RAG systems: Use IVF + PQ (efficient storage + speed) or HNSW (low-latency retrieval).

HNSW is the “Ferrari” of ANN search — lightning fast but memory-hungry. IVF+PQ is the “Tesla” — efficient, configurable, and scalable.

3️⃣ Recall vs. Latency — The Eternal Tug-of-War

There’s no free lunch. You can have perfect recall (find all true neighbors) or instant speed, but rarely both.

Mode	Recall	Latency	Description
Exact Search (Flat)	100%	High	Checks all vectors.
Approximate Search (ANN)	~95–99%	Low	Skips less relevant regions.

In practice, a 95% recall with 10× speed is well worth it for production. After all, your LLM doesn’t need every document — just the most relevant ones fast enough to reason on them.

In RAG pipelines, retrieval is a bottleneck. Optimizing for response time often matters more than mathematical perfection.

4️⃣ Key Tuning Parameters

Every indexing method exposes knobs that control the balance between accuracy and speed:

🔹 FAISS (IVF):

nlist → number of clusters (more = finer partitioning).
nprobe → number of clusters to search (higher = more recall, slower).

🔹 HNSW:

efConstruction → how many neighbors to consider during index build (higher = better graph).
efSearch → how many neighbors to explore during query (higher = higher recall, slower).
M → number of edges per node (connectivity).

🔹 PQ (Product Quantization):

m → number of subspaces.
k → number of centroids per subspace.
Tuning these compresses vectors efficiently but may lose fine detail.

Example:

For a 100M document corpus at 1000 QPS (queries per second):
Use HNSW (for speed) or IVF+PQ (for memory balance).
Tune efSearch=64, nprobe=16.
Add caching for frequent queries.

📐 Step 3: Mathematical Foundation

Distance & Similarity Metrics

The core of vector search lies in distance metrics.

Given vectors $A$ and $B$:

Cosine Similarity: $$ \text{sim}(A,B) = \frac{A \cdot B}{||A|| , ||B||} $$
Euclidean Distance: $$ d(A,B) = ||A - B|| $$
Inner Product (Dot Product): $$ A \cdot B = \sum_i A_i B_i $$

Vector databases often transform similarity into distance so that lower = closer. For example, FAISS internally converts cosine similarity to inner product form for efficiency.

Similarity = “How much two ideas point the same way.” Distance = “How far apart they are in the semantic universe.” Indexing structures help find the nearest points of meaning efficiently.

🧠 Step 4: Key Ideas & Assumptions

Embeddings are stored and searched in high-dimensional space.
Index structures approximate nearest neighbors to reduce latency.
Retrieval speed and recall must be balanced.
The same embedding model should be used for documents and queries.
Caching and hybrid (text + vector) search improve performance in production.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths:

Enables lightning-fast semantic search across billions of vectors.
Scalable to high-throughput applications (RAG, QA, search).
Highly tunable for precision vs. performance.

⚠️ Limitations:

ANN search can miss some relevant results.
Memory-heavy (especially for graph-based indexes like HNSW).
Requires careful tuning to match domain and data distribution.

⚖️ Trade-offs:

Recall vs. Latency: More recall = slower retrieval.
Memory vs. Accuracy: Compressed indexes (PQ) save space but lose precision.
Flexibility vs. Control: Cloud-managed DBs (Pinecone) are easier, but open-source tools (FAISS) offer full tuning.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Vector databases are just regular databases.” → No, they specialize in high-dimensional similarity search.
“Approximate search means poor accuracy.” → Not true; 95–99% recall is achievable with good tuning.
“Bigger indexes always perform better.” → Not necessarily — over-indexing can waste memory and slow down updates.

🧩 Step 7: Mini Summary

🧠 What You Learned: Vector databases store embeddings and use indexing algorithms (like IVF or HNSW) to perform lightning-fast similarity searches at scale.

⚙️ How It Works: Instead of brute-force comparing all vectors, the system organizes them into clusters or graphs, searching only the most promising neighborhoods using approximate nearest neighbor algorithms.

🎯 Why It Matters: Efficient vector search is the engine of RAG systems — it determines how fast and accurately your LLM can retrieve relevant context, especially in enterprise-scale knowledge bases.

3.4. Chunking and Context Windows 3.2. Embedding Models for RAG