2.4. Embeddings
π Flashcards
β‘ Short Theories
Embeddings are compressed representations of entities that capture semantic similarity in a continuous vector space.
Word2vec learns word embeddings using CBOW (predict word from context) or Skip-gram (predict context from word).
Contextual embeddings (ELMo, BERT) generate different vectors for the same word depending on usage.
Autoencoders can produce image embeddings by compressing input pixels into lower-dimensional latent codes.
Two-tower neural networks map interacting entities (e.g., users and items) into the same vector space for ranking and retrieval.
Transfer learning with embeddings allows reuse across tasks, reducing data requirements.
π€ Interview Q&A
Q1: What are embeddings and why do we use them in machine learning?
π― TL;DR: Embeddings map entities into dense vectors capturing semantic relationships, making ML models more effective and efficient.
π± Conceptual Explanation
Embeddings are like compressing a complex object (word, image, user) into a short “summary vector” that captures its meaning. This makes it easier for models to compare entities and learn patterns.
π Technical / Math Details
- Embeddings are vectors $e \in \mathbb{R}^d$ with $d \ll$ original feature space.
- Learned using neural networks optimizing similarity-based loss (dot products, softmax).
βοΈ Trade-offs & Production Notes
- Lower dimensions β faster inference, but risk of information loss.
- Pre-trained embeddings save time but may not fit domain-specific nuances.
π¨ Common Pitfalls
- Blindly reusing embeddings without fine-tuning.
- Overfitting when embedding dimension is too large.
π£ Interview-ready Answer
“Embeddings are dense vectors that represent entities in a way that captures similarity. They’re widely used because they compress complex inputs into meaningful representations for ML models.”
Q2: Explain CBOW and Skip-gram in Word2vec.
π― TL;DR: CBOW predicts a word from context; Skip-gram predicts context words from a target word.
π± Conceptual Explanation
- CBOW: Takes nearby words to guess the missing word (good for small datasets).
- Skip-gram: Uses one word to predict surrounding words (good for large datasets).
π Technical / Math Details
- CBOW Loss:
$$ L = -\log p(w_t | w_{t-n}, \dots, w_{t+n}) $$ - Skip-gram Loss:
$$ L = -\log p(w_{t-n}, \dots, w_{t+n} | w_t) $$
βοΈ Trade-offs & Production Notes
- CBOW is faster, requires less data.
- Skip-gram captures rare words better but is slower.
π¨ Common Pitfalls
- Ignoring window size tuning.
- Using raw probabilities instead of negative sampling.
π£ Interview-ready Answer
“CBOW predicts the target word from its context, while Skip-gram predicts context words from the target. CBOW is efficient, Skip-gram works better with large datasets.”
Q3: Why are contextual embeddings (ELMo, BERT) better than Word2vec?
π― TL;DR: Contextual embeddings adjust word vectors based on surrounding text, unlike static Word2vec.
π± Conceptual Explanation
Word2vec gives one vector per word regardless of meaning. Contextual models adapt vectors dynamically, so “apple” in “fruit” β “apple” in “company.”
π Technical / Math Details
- ELMo: Bi-LSTM over entire sequence β word embedding = function of forward + backward states.
- BERT: Transformer-based, uses attention to consider all positions jointly.
βοΈ Trade-offs & Production Notes
- More accurate representations.
- Heavier computation, higher memory.
- Better for downstream fine-tuning.
π¨ Common Pitfalls
- Forgetting embeddings change per input.
- Underestimating computational cost in production.
π£ Interview-ready Answer
“Contextual embeddings like BERT adjust word vectors based on surrounding words, unlike static Word2vec, making them more accurate for NLP tasks.”
Q4: How do autoencoders generate visual embeddings?
π― TL;DR: Autoencoders compress images into dense vectors via an encoder-decoder architecture.
π± Conceptual Explanation
They squeeze high-dimensional pixel data into a smaller latent code (embedding), then reconstruct the original image to ensure the vector contains key features.
π Technical / Math Details
- Encoder: $x \in \mathbb{R}^n \to z \in \mathbb{R}^d$
- Decoder: $z \to \hat{x}$
- Loss: $L = ||x - \hat{x}||^2$
βοΈ Trade-offs & Production Notes
- Good unsupervised feature learning.
- Embeddings can be reused for classification/search.
- Quality depends on bottleneck size.
π¨ Common Pitfalls
- Too small latent size β loss of detail.
- Overcomplete autoencoders β trivial identity mapping.
π£ Interview-ready Answer
“Autoencoders compress images into embeddings via the encoder, trained to minimize reconstruction loss, producing useful dense visual representations.”
Q5: What is a two-tower embedding model and where is it used?
π― TL;DR: Two-tower models learn embeddings for two entity types (e.g., users, items) so their dot product reflects similarity.
π± Conceptual Explanation
Imagine two separate networks: one encodes users, the other encodes items. If a user-item pair is positive (clicked, watched), their vectors should align closely.
π Technical / Math Details
- User encoder: $u = f(x_u)$
- Item encoder: $v = g(x_v)$
- Loss:
$$ L = \max(0, \sum_{(u,v)\in A} u \cdot v - \sum_{(u,v)\notin A} u \cdot v) $$
βοΈ Trade-offs & Production Notes
- Efficient retrieval with vector search.
- Scales well for large catalogs.
- Requires good negative sampling.
π¨ Common Pitfalls
- Poor sampling β collapse.
- Embeddings drift without regular retraining.
π£ Interview-ready Answer
“A two-tower model separately embeds users and items, training so positive pairs have higher similarity. Itβs common in ranking, retrieval, and recommendations.”
π Key Formulas
CBOW Loss Function
- $w_t$: target word
- $w_{t-n}, \dots, w_{t+n}$: context words
Interpretation: Optimize embeddings so that context predicts the center word.
Skip-gram Loss Function
- $w_t$: target word
- Context: surrounding words
Interpretation: Optimize embeddings so that one word predicts its neighbors.
Autoencoder Loss
- $x$: input
- $\hat{x}$: reconstructed input
Interpretation: Embedding must retain enough info to reconstruct the original input.
Two-Tower Inner Product Loss
- $u$: user embedding
- $v$: item embedding
- $A$: set of positive interaction pairs
Interpretation: Positive pairs should have higher similarity than negatives.
β Cheatsheet
- Embeddings: Dense vectors capturing semantics.
- Word2vec: CBOW (predict word from context), Skip-gram (predict context from word).
- Contextual: ELMo (BiLSTM), BERT (Transformers + attention).
- Visual: Autoencoders (latent codes), CNN penultimate layers (e.g., VGG16).
- Specialized: Task-specific embeddings during training.
- Two-Tower: User/item embedding for retrieval, ranking.