2.3. Positional Encoding

5 min read 919 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Transformers are attention-based, and attention doesn’t care about order by default. It just sees a set of tokens. So if you feed it “I love pizza” or “Pizza love I,” it treats them the same — both just bags of words.

That’s a problem, because meaning depends on order.

To fix this, Transformers inject a sense of position into token representations. This process — giving the model a map of where each token lives in the sequence — is called Positional Encoding.


  • Simple Analogy: Imagine a playlist with no track numbers. You have the songs but no idea which comes first or last. Adding positional encoding is like labeling songs “Track 1, Track 2, Track 3” — now the model knows the order of things.

🌱 Step 2: Core Concept

Transformers need a way to understand sequence order without using recurrence (like RNNs) or convolutions. So, we add position information directly to embeddings — giving each token a sense of where it is in the sequence.


What’s Happening Under the Hood?

The Transformer takes a sequence of token embeddings (from words, subwords, etc.), each a vector of dimension $d_{\text{model}}$.

To add position awareness, we create a positional encoding vector of the same dimension for each position (0, 1, 2, …).

Then we simply add the positional encoding to the token embedding:

$$ \text{Input}_\text{with position} = \text{Embedding}(x_i) + \text{PositionalEncoding}(i) $$

Now, each input vector carries both semantic meaning (the word itself) and syntactic information (its position in the sentence).


Why It Works This Way

Attention treats input tokens as a set — no inherent sense of “before” or “after.” By adding position vectors, we give the model coordinates in “sequence space,” letting it know relationships like “this token is 3 steps after that one.”

The sinusoidal pattern (we’ll see soon) is designed so that these relationships are easy for the model to learn through simple arithmetic — because sine and cosine have smooth, periodic relationships.


How It Fits in ML Thinking

Positional encoding is the bridge between the set-based attention world and the sequential world of language or time-series data.

Without it, Transformers would lose order — they’d be great at understanding what words appear, but clueless about in what order. With it, they can model syntax, rhythm, and time-dependent structure without recurrence.


📐 Step 3: Mathematical Foundation

Let’s understand the math gently — not to memorize formulas, but to feel why they look this way.


Sinusoidal Positional Encoding

For a position $pos$ (like 0, 1, 2, …) and dimension $i$ (from 0 to $d_{\text{model}}-1$):

$$ \text{PE}*{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$

$$ \text{PE}*{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$

Each position gets a unique, smoothly varying pattern across dimensions.

  • Even dimensions use sine.
  • Odd dimensions use cosine.

Why such a design? Because sine and cosine with different frequencies let the model express relative positions simply — via linear combinations.

Imagine each dimension as a musical note with a different frequency. Together, they create a “melody” unique to each position — and the model learns to recognize patterns in those melodies.

Periodicity & Smooth Transitions

The reason for using different frequencies (via $10000^{2i/d_{\text{model}}}$) is to allow the encoding to represent both short-range and long-range relationships.

Small dimensions (small denominators) vary rapidly → capture local order. Large dimensions vary slowly → capture global order.

Thus, position 5 and 6 are close in low-frequency bands but far in high-frequency bands — giving multi-scale awareness of order.


Learned Positional Embeddings

Instead of fixed sinusoids, we can make position vectors trainable — letting the model learn whatever encoding best fits the data.

This is done just like word embeddings:

$$ \text{PE}[pos] = W_{\text{pos}}[pos] $$

where $W_{\text{pos}}$ is a trainable matrix of size (sequence length × embedding dimension).

Empirically:

  • Learned embeddings can adapt better to domain-specific order (like text syntax).
  • Sinusoidal encodings generalize better to longer unseen sequences (since they follow a continuous formula).
Fixed encodings = mathematical compass 🧭 (can extrapolate far). Learned encodings = custom GPS 🗺️ (works best where trained).

🧠 Step 4: Key Ideas

  • Self-attention alone is order-agnostic — it doesn’t know token sequence.
  • Positional encoding injects order using trigonometric or learned patterns.
  • Sine–cosine encodings allow relative position comparison through arithmetic.
  • Learned encodings adapt to data but may not generalize beyond training length.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Gives Transformers sequence awareness.
  • Sinusoidal encodings generalize well to longer inputs.
  • Adds no recurrence overhead — remains parallelizable.
  • Fixed encodings are inflexible for specific patterns.
  • Learned encodings can’t extrapolate to unseen positions.
  • Adds complexity when mixing with architectures like relative attention.
Fixed vs. Learned → Generalization vs. Adaptation. Fixed is like a universal clock; learned is like a local calendar — choose depending on whether you expect to “travel” beyond known sequence lengths.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Self-attention already knows order because it looks at all tokens.” No — it looks at all tokens equally, without sequence awareness.
  • “Sine–cosine is arbitrary math.” Not arbitrary — it gives the model a continuous, distance-preserving coordinate system.
  • “Learned encodings always outperform sinusoidal ones.” Only when test sequences resemble training ones; otherwise, fixed encodings generalize better.

🧩 Step 7: Mini Summary

🧠 What You Learned: Positional encoding gives Transformers a sense of word order that attention alone lacks.

⚙️ How It Works: Adds a position-dependent vector (sinusoidal or learned) to each token embedding.

🎯 Why It Matters: Without it, Transformers would behave like bag-of-words models — powerful but order-blind.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!