2.3. Positional Encoding

Generative AI & LLM Interview Guide for Top Roles (2025)

5 min read 919 words

🪄 Step 1: Intuition & Motivation

Core Idea: Transformers are attention-based, and attention doesn’t care about order by default. It just sees a set of tokens. So if you feed it “I love pizza” or “Pizza love I,” it treats them the same — both just bags of words.

That’s a problem, because meaning depends on order.

To fix this, Transformers inject a sense of position into token representations. This process — giving the model a map of where each token lives in the sequence — is called Positional Encoding.

Simple Analogy: Imagine a playlist with no track numbers. You have the songs but no idea which comes first or last. Adding positional encoding is like labeling songs “Track 1, Track 2, Track 3” — now the model knows the order of things.

🌱 Step 2: Core Concept

Transformers need a way to understand sequence order without using recurrence (like RNNs) or convolutions. So, we add position information directly to embeddings — giving each token a sense of where it is in the sequence.

What’s Happening Under the Hood?

The Transformer takes a sequence of token embeddings (from words, subwords, etc.), each a vector of dimension $d_{\text{model}}$.

To add position awareness, we create a positional encoding vector of the same dimension for each position (0, 1, 2, …).

Then we simply add the positional encoding to the token embedding:

$$ \text{Input}_\text{with position} = \text{Embedding}(x_i) + \text{PositionalEncoding}(i) $$

Now, each input vector carries both semantic meaning (the word itself) and syntactic information (its position in the sentence).

Why It Works This Way

Attention treats input tokens as a set — no inherent sense of “before” or “after.” By adding position vectors, we give the model coordinates in “sequence space,” letting it know relationships like “this token is 3 steps after that one.”

The sinusoidal pattern (we’ll see soon) is designed so that these relationships are easy for the model to learn through simple arithmetic — because sine and cosine have smooth, periodic relationships.

How It Fits in ML Thinking

Positional encoding is the bridge between the set-based attention world and the sequential world of language or time-series data.

Without it, Transformers would lose order — they’d be great at understanding what words appear, but clueless about in what order. With it, they can model syntax, rhythm, and time-dependent structure without recurrence.

📐 Step 3: Mathematical Foundation

Let’s understand the math gently — not to memorize formulas, but to feel why they look this way.

Sinusoidal Positional Encoding

For a position $pos$ (like 0, 1, 2, …) and dimension $i$ (from 0 to $d_{\text{model}}-1$):

$$ \text{PE}*{(pos, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$

$$ \text{PE}*{(pos, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d*{\text{model}}}}\right) $$

Each position gets a unique, smoothly varying pattern across dimensions.

Even dimensions use sine.
Odd dimensions use cosine.

Why such a design? Because sine and cosine with different frequencies let the model express relative positions simply — via linear combinations.

Imagine each dimension as a musical note with a different frequency. Together, they create a “melody” unique to each position — and the model learns to recognize patterns in those melodies.

Periodicity & Smooth Transitions

The reason for using different frequencies (via $10000^{2i/d_{\text{model}}}$) is to allow the encoding to represent both short-range and long-range relationships.

Small dimensions (small denominators) vary rapidly → capture local order. Large dimensions vary slowly → capture global order.

Thus, position 5 and 6 are close in low-frequency bands but far in high-frequency bands — giving multi-scale awareness of order.

Learned Positional Embeddings

Instead of fixed sinusoids, we can make position vectors trainable — letting the model learn whatever encoding best fits the data.

This is done just like word embeddings:

$$ \text{PE}[pos] = W_{\text{pos}}[pos] $$

where $W_{\text{pos}}$ is a trainable matrix of size (sequence length × embedding dimension).

Empirically:

Learned embeddings can adapt better to domain-specific order (like text syntax).
Sinusoidal encodings generalize better to longer unseen sequences (since they follow a continuous formula).

Fixed encodings = mathematical compass 🧭 (can extrapolate far). Learned encodings = custom GPS 🗺️ (works best where trained).

🧠 Step 4: Key Ideas

Self-attention alone is order-agnostic — it doesn’t know token sequence.
Positional encoding injects order using trigonometric or learned patterns.
Sine–cosine encodings allow relative position comparison through arithmetic.
Learned encodings adapt to data but may not generalize beyond training length.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Gives Transformers sequence awareness.
Sinusoidal encodings generalize well to longer inputs.
Adds no recurrence overhead — remains parallelizable.

Fixed encodings are inflexible for specific patterns.
Learned encodings can’t extrapolate to unseen positions.
Adds complexity when mixing with architectures like relative attention.

Fixed vs. Learned → Generalization vs. Adaptation. Fixed is like a universal clock; learned is like a local calendar — choose depending on whether you expect to “travel” beyond known sequence lengths.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Self-attention already knows order because it looks at all tokens.” No — it looks at all tokens equally, without sequence awareness.
“Sine–cosine is arbitrary math.” Not arbitrary — it gives the model a continuous, distance-preserving coordinate system.
“Learned encodings always outperform sinusoidal ones.” Only when test sequences resemble training ones; otherwise, fixed encodings generalize better.

🧩 Step 7: Mini Summary

🧠 What You Learned: Positional encoding gives Transformers a sense of word order that attention alone lacks.

⚙️ How It Works: Adds a position-dependent vector (sinusoidal or learned) to each token embedding.

🎯 Why It Matters: Without it, Transformers would behave like bag-of-words models — powerful but order-blind.

2.4. Feed-Forward Layers and Residual Connections 2.2. Multi-Head Attention