2.4. Embeddings

6 min read 1076 words

πŸ“ Flashcards

⚑ Short Theories

Embeddings are compressed representations of entities that capture semantic similarity in a continuous vector space.

Word2vec learns word embeddings using CBOW (predict word from context) or Skip-gram (predict context from word).

Contextual embeddings (ELMo, BERT) generate different vectors for the same word depending on usage.

Autoencoders can produce image embeddings by compressing input pixels into lower-dimensional latent codes.

Two-tower neural networks map interacting entities (e.g., users and items) into the same vector space for ranking and retrieval.

Transfer learning with embeddings allows reuse across tasks, reducing data requirements.


🎀 Interview Q&A

Q1: What are embeddings and why do we use them in machine learning?

🎯 TL;DR: Embeddings map entities into dense vectors capturing semantic relationships, making ML models more effective and efficient.


🌱 Conceptual Explanation

Embeddings are like compressing a complex object (word, image, user) into a short “summary vector” that captures its meaning. This makes it easier for models to compare entities and learn patterns.

πŸ“ Technical / Math Details

  • Embeddings are vectors $e \in \mathbb{R}^d$ with $d \ll$ original feature space.
  • Learned using neural networks optimizing similarity-based loss (dot products, softmax).

βš–οΈ Trade-offs & Production Notes

  • Lower dimensions β†’ faster inference, but risk of information loss.
  • Pre-trained embeddings save time but may not fit domain-specific nuances.

🚨 Common Pitfalls

  • Blindly reusing embeddings without fine-tuning.
  • Overfitting when embedding dimension is too large.

πŸ—£ Interview-ready Answer

“Embeddings are dense vectors that represent entities in a way that captures similarity. They’re widely used because they compress complex inputs into meaningful representations for ML models.”


Q2: Explain CBOW and Skip-gram in Word2vec.

🎯 TL;DR: CBOW predicts a word from context; Skip-gram predicts context words from a target word.


🌱 Conceptual Explanation

  • CBOW: Takes nearby words to guess the missing word (good for small datasets).
  • Skip-gram: Uses one word to predict surrounding words (good for large datasets).

πŸ“ Technical / Math Details

  • CBOW Loss:
    $$ L = -\log p(w_t | w_{t-n}, \dots, w_{t+n}) $$
  • Skip-gram Loss:
    $$ L = -\log p(w_{t-n}, \dots, w_{t+n} | w_t) $$

βš–οΈ Trade-offs & Production Notes

  • CBOW is faster, requires less data.
  • Skip-gram captures rare words better but is slower.

🚨 Common Pitfalls

  • Ignoring window size tuning.
  • Using raw probabilities instead of negative sampling.

πŸ—£ Interview-ready Answer

“CBOW predicts the target word from its context, while Skip-gram predicts context words from the target. CBOW is efficient, Skip-gram works better with large datasets.”


Q3: Why are contextual embeddings (ELMo, BERT) better than Word2vec?

🎯 TL;DR: Contextual embeddings adjust word vectors based on surrounding text, unlike static Word2vec.


🌱 Conceptual Explanation

Word2vec gives one vector per word regardless of meaning. Contextual models adapt vectors dynamically, so “apple” in “fruit” β‰  “apple” in “company.”

πŸ“ Technical / Math Details

  • ELMo: Bi-LSTM over entire sequence β†’ word embedding = function of forward + backward states.
  • BERT: Transformer-based, uses attention to consider all positions jointly.

βš–οΈ Trade-offs & Production Notes

  • More accurate representations.
  • Heavier computation, higher memory.
  • Better for downstream fine-tuning.

🚨 Common Pitfalls

  • Forgetting embeddings change per input.
  • Underestimating computational cost in production.

πŸ—£ Interview-ready Answer

“Contextual embeddings like BERT adjust word vectors based on surrounding words, unlike static Word2vec, making them more accurate for NLP tasks.”


Q4: How do autoencoders generate visual embeddings?

🎯 TL;DR: Autoencoders compress images into dense vectors via an encoder-decoder architecture.


🌱 Conceptual Explanation

They squeeze high-dimensional pixel data into a smaller latent code (embedding), then reconstruct the original image to ensure the vector contains key features.

πŸ“ Technical / Math Details

  • Encoder: $x \in \mathbb{R}^n \to z \in \mathbb{R}^d$
  • Decoder: $z \to \hat{x}$
  • Loss: $L = ||x - \hat{x}||^2$

βš–οΈ Trade-offs & Production Notes

  • Good unsupervised feature learning.
  • Embeddings can be reused for classification/search.
  • Quality depends on bottleneck size.

🚨 Common Pitfalls

  • Too small latent size β†’ loss of detail.
  • Overcomplete autoencoders β†’ trivial identity mapping.

πŸ—£ Interview-ready Answer

“Autoencoders compress images into embeddings via the encoder, trained to minimize reconstruction loss, producing useful dense visual representations.”


Q5: What is a two-tower embedding model and where is it used?

🎯 TL;DR: Two-tower models learn embeddings for two entity types (e.g., users, items) so their dot product reflects similarity.


🌱 Conceptual Explanation

Imagine two separate networks: one encodes users, the other encodes items. If a user-item pair is positive (clicked, watched), their vectors should align closely.

πŸ“ Technical / Math Details

  • User encoder: $u = f(x_u)$
  • Item encoder: $v = g(x_v)$
  • Loss:
    $$ L = \max(0, \sum_{(u,v)\in A} u \cdot v - \sum_{(u,v)\notin A} u \cdot v) $$

βš–οΈ Trade-offs & Production Notes

  • Efficient retrieval with vector search.
  • Scales well for large catalogs.
  • Requires good negative sampling.

🚨 Common Pitfalls

  • Poor sampling β†’ collapse.
  • Embeddings drift without regular retraining.

πŸ—£ Interview-ready Answer

“A two-tower model separately embeds users and items, training so positive pairs have higher similarity. It’s common in ranking, retrieval, and recommendations.”


πŸ“ Key Formulas

CBOW Loss Function
$$ L = -\log p(w_t | w_{t-n}, \dots, w_{t+n}) $$
  • $w_t$: target word
  • $w_{t-n}, \dots, w_{t+n}$: context words
    Interpretation: Optimize embeddings so that context predicts the center word.
Skip-gram Loss Function
$$ L = -\log p(w_{t-n}, \dots, w_{t+n} | w_t) $$
  • $w_t$: target word
  • Context: surrounding words
    Interpretation: Optimize embeddings so that one word predicts its neighbors.
Autoencoder Loss
$$ L = ||x - \hat{x}||^2 $$
  • $x$: input
  • $\hat{x}$: reconstructed input
    Interpretation: Embedding must retain enough info to reconstruct the original input.
Two-Tower Inner Product Loss
$$ L = \max\left(0, \sum_{(u,v)\in A} u \cdot v - \sum_{(u,v)\notin A} u \cdot v\right) $$
  • $u$: user embedding
  • $v$: item embedding
  • $A$: set of positive interaction pairs
    Interpretation: Positive pairs should have higher similarity than negatives.

βœ… Cheatsheet

  • Embeddings: Dense vectors capturing semantics.
  • Word2vec: CBOW (predict word from context), Skip-gram (predict context from word).
  • Contextual: ELMo (BiLSTM), BERT (Transformers + attention).
  • Visual: Autoencoders (latent codes), CNN penultimate layers (e.g., VGG16).
  • Specialized: Task-specific embeddings during training.
  • Two-Tower: User/item embedding for retrieval, ranking.
Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!