2.4. Embeddings

📚 Notes

Module

ML System Design

2.4. Embeddings

6 min read 1076 words

📝 Flashcards

What are embeddings?

Low-dimensional vectors encoding semantic info.

Why are embeddings useful?

Word2vec training style?

Self-supervised on large text corpora.

CBOW objective?

Predict current word from surrounding words.

Skip-gram objective?

Predict surrounding words from current word.

Key difference: Word2vec vs contextual embeddings?

Word2vec fixed; contextual vary by context.

ELMo architecture?

Bi-directional LSTMs for context.

BERT key feature?

Attention sees full context bidirectionally.

Autoencoder role in embeddings?

Encoder compresses input → dense representation.

VGG16 penultimate layer?

Used as image embedding.

Two-tower embedding model purpose?

Align users/items in same space via inner product loss.

Transfer learning with embeddings?

Reuse learned vectors across tasks.

Specialized embeddings?

Task-specific embeddings learned during training.

⚡ Short Theories

Embeddings are compressed representations of entities that capture semantic similarity in a continuous vector space.

Word2vec learns word embeddings using CBOW (predict word from context) or Skip-gram (predict context from word).

Contextual embeddings (ELMo, BERT) generate different vectors for the same word depending on usage.

Autoencoders can produce image embeddings by compressing input pixels into lower-dimensional latent codes.

Two-tower neural networks map interacting entities (e.g., users and items) into the same vector space for ranking and retrieval.

Transfer learning with embeddings allows reuse across tasks, reducing data requirements.

🎤 Interview Q&A

Q1: What are embeddings and why do we use them in machine learning?

🎯 TL;DR: Embeddings map entities into dense vectors capturing semantic relationships, making ML models more effective and efficient.

🌱 Conceptual Explanation

Embeddings are like compressing a complex object (word, image, user) into a short “summary vector” that captures its meaning. This makes it easier for models to compare entities and learn patterns.

📐 Technical / Math Details

Embeddings are vectors $e \in \mathbb{R}^d$ with $d \ll$ original feature space.
Learned using neural networks optimizing similarity-based loss (dot products, softmax).

⚖️ Trade-offs & Production Notes

Lower dimensions → faster inference, but risk of information loss.
Pre-trained embeddings save time but may not fit domain-specific nuances.

🚨 Common Pitfalls

Blindly reusing embeddings without fine-tuning.
Overfitting when embedding dimension is too large.

🗣 Interview-ready Answer

“Embeddings are dense vectors that represent entities in a way that captures similarity. They’re widely used because they compress complex inputs into meaningful representations for ML models.”

Q2: Explain CBOW and Skip-gram in Word2vec.

🎯 TL;DR: CBOW predicts a word from context; Skip-gram predicts context words from a target word.

🌱 Conceptual Explanation

CBOW: Takes nearby words to guess the missing word (good for small datasets).
Skip-gram: Uses one word to predict surrounding words (good for large datasets).

📐 Technical / Math Details

CBOW Loss:
$$ L = -\log p(w_t | w_{t-n}, \dots, w_{t+n}) $$
Skip-gram Loss:
$$ L = -\log p(w_{t-n}, \dots, w_{t+n} | w_t) $$

⚖️ Trade-offs & Production Notes

CBOW is faster, requires less data.
Skip-gram captures rare words better but is slower.

🚨 Common Pitfalls

Ignoring window size tuning.
Using raw probabilities instead of negative sampling.

🗣 Interview-ready Answer

“CBOW predicts the target word from its context, while Skip-gram predicts context words from the target. CBOW is efficient, Skip-gram works better with large datasets.”

Q3: Why are contextual embeddings (ELMo, BERT) better than Word2vec?

🎯 TL;DR: Contextual embeddings adjust word vectors based on surrounding text, unlike static Word2vec.

🌱 Conceptual Explanation

Word2vec gives one vector per word regardless of meaning. Contextual models adapt vectors dynamically, so “apple” in “fruit” ≠ “apple” in “company.”

📐 Technical / Math Details

ELMo: Bi-LSTM over entire sequence → word embedding = function of forward + backward states.
BERT: Transformer-based, uses attention to consider all positions jointly.

⚖️ Trade-offs & Production Notes

More accurate representations.
Heavier computation, higher memory.
Better for downstream fine-tuning.

🚨 Common Pitfalls

Forgetting embeddings change per input.
Underestimating computational cost in production.

🗣 Interview-ready Answer

“Contextual embeddings like BERT adjust word vectors based on surrounding words, unlike static Word2vec, making them more accurate for NLP tasks.”

Q4: How do autoencoders generate visual embeddings?

🎯 TL;DR: Autoencoders compress images into dense vectors via an encoder-decoder architecture.

🌱 Conceptual Explanation

They squeeze high-dimensional pixel data into a smaller latent code (embedding), then reconstruct the original image to ensure the vector contains key features.

📐 Technical / Math Details

Encoder: $x \in \mathbb{R}^n \to z \in \mathbb{R}^d$
Decoder: $z \to \hat{x}$
Loss: $L = ||x - \hat{x}||^2$

⚖️ Trade-offs & Production Notes

Good unsupervised feature learning.
Embeddings can be reused for classification/search.
Quality depends on bottleneck size.

🚨 Common Pitfalls

Too small latent size → loss of detail.
Overcomplete autoencoders → trivial identity mapping.

🗣 Interview-ready Answer

“Autoencoders compress images into embeddings via the encoder, trained to minimize reconstruction loss, producing useful dense visual representations.”

Q5: What is a two-tower embedding model and where is it used?

🎯 TL;DR: Two-tower models learn embeddings for two entity types (e.g., users, items) so their dot product reflects similarity.

🌱 Conceptual Explanation

Imagine two separate networks: one encodes users, the other encodes items. If a user-item pair is positive (clicked, watched), their vectors should align closely.

📐 Technical / Math Details

User encoder: $u = f(x_u)$
Item encoder: $v = g(x_v)$
Loss:
$$ L = \max(0, \sum_{(u,v)\in A} u \cdot v - \sum_{(u,v)\notin A} u \cdot v) $$

⚖️ Trade-offs & Production Notes

Efficient retrieval with vector search.
Scales well for large catalogs.
Requires good negative sampling.

🚨 Common Pitfalls

Poor sampling → collapse.
Embeddings drift without regular retraining.

🗣 Interview-ready Answer

“A two-tower model separately embeds users and items, training so positive pairs have higher similarity. It’s common in ranking, retrieval, and recommendations.”

📐 Key Formulas

CBOW Loss Function

$$ L = -\log p(w_t | w_{t-n}, \dots, w_{t+n}) $$

$w_t$: target word
$w_{t-n}, \dots, w_{t+n}$: context words
Interpretation: Optimize embeddings so that context predicts the center word.

Skip-gram Loss Function

$$ L = -\log p(w_{t-n}, \dots, w_{t+n} | w_t) $$

$w_t$: target word
Context: surrounding words
Interpretation: Optimize embeddings so that one word predicts its neighbors.

Autoencoder Loss

$$ L = ||x - \hat{x}||^2 $$

$x$: input
$\hat{x}$: reconstructed input
Interpretation: Embedding must retain enough info to reconstruct the original input.

Two-Tower Inner Product Loss

$$ L = \max\left(0, \sum_{(u,v)\in A} u \cdot v - \sum_{(u,v)\notin A} u \cdot v\right) $$

$u$: user embedding
$v$: item embedding
$A$: set of positive interaction pairs
Interpretation: Positive pairs should have higher similarity than negatives.

✅ Cheatsheet

Embeddings: Dense vectors capturing semantics.
Word2vec: CBOW (predict word from context), Skip-gram (predict context from word).
Contextual: ELMo (BiLSTM), BERT (Transformers + attention).
Visual: Autoencoders (latent codes), CNN penultimate layers (e.g., VGG16).
Specialized: Task-specific embeddings during training.
Two-Tower: User/item embedding for retrieval, ranking.

2.5. Transfer Learning 2.3. Online Experimentation