2.3. Positional Encoding
🪄 Step 1: Intuition & Motivation
- Core Idea: Transformers are attention-based, and attention doesn’t care about order by default. It just sees a set of tokens. So if you feed it “I love pizza” or “Pizza love I,” it treats them the same — both just bags of words.
That’s a problem, because meaning depends on order.
To fix this, Transformers inject a sense of position into token representations. This process — giving the model a map of where each token lives in the sequence — is called Positional Encoding.
- Simple Analogy: Imagine a playlist with no track numbers. You have the songs but no idea which comes first or last. Adding positional encoding is like labeling songs “Track 1, Track 2, Track 3” — now the model knows the order of things.
🌱 Step 2: Core Concept
Transformers need a way to understand sequence order without using recurrence (like RNNs) or convolutions. So, we add position information directly to embeddings — giving each token a sense of where it is in the sequence.
What’s Happening Under the Hood?
The Transformer takes a sequence of token embeddings (from words, subwords, etc.), each a vector of dimension $d_{\text{model}}$.
To add position awareness, we create a positional encoding vector of the same dimension for each position (0, 1, 2, …).
Then we simply add the positional encoding to the token embedding:
$$ \text{Input}_\text{with position} = \text{Embedding}(x_i) + \text{PositionalEncoding}(i) $$Now, each input vector carries both semantic meaning (the word itself) and syntactic information (its position in the sentence).
Why It Works This Way
Attention treats input tokens as a set — no inherent sense of “before” or “after.” By adding position vectors, we give the model coordinates in “sequence space,” letting it know relationships like “this token is 3 steps after that one.”
The sinusoidal pattern (we’ll see soon) is designed so that these relationships are easy for the model to learn through simple arithmetic — because sine and cosine have smooth, periodic relationships.
How It Fits in ML Thinking
Positional encoding is the bridge between the set-based attention world and the sequential world of language or time-series data.
Without it, Transformers would lose order — they’d be great at understanding what words appear, but clueless about in what order. With it, they can model syntax, rhythm, and time-dependent structure without recurrence.
📐 Step 3: Mathematical Foundation
Let’s understand the math gently — not to memorize formulas, but to feel why they look this way.
Sinusoidal Positional Encoding
For a position $pos$ (like 0, 1, 2, …) and dimension $i$ (from 0 to $d_{\text{model}}-1$):
$$ \text{PE}*{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$$$ \text{PE}*{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$Each position gets a unique, smoothly varying pattern across dimensions.
- Even dimensions use sine.
- Odd dimensions use cosine.
Why such a design? Because sine and cosine with different frequencies let the model express relative positions simply — via linear combinations.
Periodicity & Smooth Transitions
The reason for using different frequencies (via $10000^{2i/d_{\text{model}}}$) is to allow the encoding to represent both short-range and long-range relationships.
Small dimensions (small denominators) vary rapidly → capture local order. Large dimensions vary slowly → capture global order.
Thus, position 5 and 6 are close in low-frequency bands but far in high-frequency bands — giving multi-scale awareness of order.
Learned Positional Embeddings
Instead of fixed sinusoids, we can make position vectors trainable — letting the model learn whatever encoding best fits the data.
This is done just like word embeddings:
$$ \text{PE}[pos] = W_{\text{pos}}[pos] $$where $W_{\text{pos}}$ is a trainable matrix of size (sequence length × embedding dimension).
Empirically:
- Learned embeddings can adapt better to domain-specific order (like text syntax).
- Sinusoidal encodings generalize better to longer unseen sequences (since they follow a continuous formula).
🧠 Step 4: Key Ideas
- Self-attention alone is order-agnostic — it doesn’t know token sequence.
- Positional encoding injects order using trigonometric or learned patterns.
- Sine–cosine encodings allow relative position comparison through arithmetic.
- Learned encodings adapt to data but may not generalize beyond training length.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Gives Transformers sequence awareness.
- Sinusoidal encodings generalize well to longer inputs.
- Adds no recurrence overhead — remains parallelizable.
- Fixed encodings are inflexible for specific patterns.
- Learned encodings can’t extrapolate to unseen positions.
- Adds complexity when mixing with architectures like relative attention.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Self-attention already knows order because it looks at all tokens.” No — it looks at all tokens equally, without sequence awareness.
- “Sine–cosine is arbitrary math.” Not arbitrary — it gives the model a continuous, distance-preserving coordinate system.
- “Learned encodings always outperform sinusoidal ones.” Only when test sequences resemble training ones; otherwise, fixed encodings generalize better.
🧩 Step 7: Mini Summary
🧠 What You Learned: Positional encoding gives Transformers a sense of word order that attention alone lacks.
⚙️ How It Works: Adds a position-dependent vector (sinusoidal or learned) to each token embedding.
🎯 Why It Matters: Without it, Transformers would behave like bag-of-words models — powerful but order-blind.