1.3. Embeddings — The Language Geometry

4 min read 792 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Tokenization gives us numbers for words — but those numbers (like 1823 or 5099) have no real-world meaning to a model. Embeddings fix this by turning each token ID into a vector — a tiny coordinate in a high-dimensional space ($\mathbb{R}^d$) — where similar words live close together.

  • Simple Analogy: Think of a huge 3D galaxy where each star represents a word. Words with similar meanings (like dog, cat, puppy) form constellations near each other, while distant stars (dog vs. keyboard) belong to different galaxies entirely. Embeddings are the coordinates of these stars — giving language a geometric shape that models can reason about.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When a model receives token IDs from the tokenizer (like [27, 1032, 8814]), it looks them up in an embedding matrix — basically a big table of vectors:

Token IDEmbedding Vector (simplified)
27 (“I”)[0.12, 0.53, -0.41, …]
1032 (“love”)[-0.88, 0.34, 0.05, …]
8814 (“Trans”)[0.61, -0.77, 0.02, …]

Each row corresponds to a token, and each column represents one “dimension” of meaning.

So when you input "I love Transformers", the model doesn’t see text — it sees a set of vectors. These vectors preserve semantic relationships that math operations can manipulate.

Why It Works This Way
Plain numbers (like 100 for “apple” and 101 for “banana”) tell the model nothing about meaning. But by mapping each word to a dense vector, embeddings allow the model to measure similarity using geometric distance — like computing how close two points are in space. That’s why "king" - "man" + "woman" ≈ "queen" — relationships between words become vector operations.
How It Fits in ML Thinking
Embeddings are the foundation of understanding for any NLP model. They act like a “semantic memory” — capturing not just word identity but also how words relate to each other. Without embeddings, neural networks couldn’t recognize that “happy” and “joyful” are similar, or that “run” and “ran” are contextually related.

📐 Step 3: Mathematical Foundation

Embedding Representation

Each word or token is represented as a vector in $d$-dimensional space:

$$ E(w_i) = \mathbf{v_i} \in \mathbb{R}^d $$

Here:

  • $E(w_i)$ — embedding of token $w_i$
  • $\mathbf{v_i}$ — its vector representation
  • $d$ — embedding dimension (e.g., 768 for BERT-base, 12288 for GPT-3)

The model learns these embeddings jointly with other parameters during training — they’re not fixed dictionaries, but evolving representations optimized for language understanding.

Think of embeddings as the “mental map” of the model. Each word gets a position in a conceptual space — and distance equals meaning. Words close together are semantically or syntactically related; far apart ones are not.

Cosine Similarity — Measuring Word Relationships

To measure how similar two embeddings are, we use cosine similarity:

$$ \text{sim}(u, v) = \frac{u \cdot v}{|u||v|} $$
  • The numerator $u \cdot v$ measures how much two vectors “align”.
  • The denominator normalizes their lengths, so we care only about direction, not magnitude.
  • Result ranges from $-1$ (opposite) to $1$ (identical).

Example:

  • sim(“cat”, “dog”) ≈ 0.8 (similar)
  • sim(“cat”, “car”) ≈ 0.1 (unrelated)
We normalize because the absolute length of a vector doesn’t represent meaning — direction does. Normalizing makes comparison fair across words.

🧠 Step 4: Assumptions or Key Ideas

  • Similar meanings correspond to nearby vectors in embedding space.
  • Vector directions encode semantic relations (like gender or tense).
  • Context determines meaning — static embeddings (Word2Vec, GloVe) can’t adjust per sentence, but contextual ones (BERT, GPT) can.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Captures nuanced relationships between words mathematically.
  • Enables transfer learning across tasks (via pretrained embeddings).
  • Provides a universal “language of meaning” for downstream models.

⚠️ Limitations

  • Static embeddings can’t disambiguate polysemous words (“bank” = river or finance?).
  • High-dimensional embeddings are computationally expensive.
  • Overly large embedding tables increase model size linearly with vocabulary.

⚖️ Trade-offs

  • Static vs. Contextual: static ones are lightweight but less flexible; contextual ones are powerful but computationally heavy.
  • Dimensionality: higher $d$ means richer representation but greater memory and slower inference.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Embeddings are just word lookups.” ❌ They’re learned, evolving parameters that encode semantic structure.
  • “Bigger embedding dimensions always mean better performance.” ❌ Not true — too large leads to overfitting or redundant features.
  • “Cosine similarity means identical words.” ❌ It measures directional closeness, not perfect equality.

🧩 Step 7: Mini Summary

🧠 What You Learned: Embeddings convert symbolic tokens into meaningful vectors in a high-dimensional space.

⚙️ How It Works: Each token is mapped to a vector that reflects its semantic and syntactic relationships with others.

🎯 Why It Matters: Embeddings are the foundation of understanding — they let LLMs reason, relate, and generalize across language.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!