5.2. Transition to Attention and Transformers
🪄 Step 1: Intuition & Motivation
Core Idea: RNNs were revolutionary — they gave machines the ability to process sequences instead of static data. But as we’ve seen, their sequential nature makes them slow and forgetful. The world needed a model that could:
- See the entire sequence at once, not step-by-step.
- Focus selectively on what matters most — whether that’s a nearby word or something from far back in the past.
That’s the birth of Attention, and later, Transformers — architectures that replaced recurrence with direct, parallelized communication across all time steps.
Simple Analogy: Imagine reading an entire page of a book instead of one word at a time. When you spot the word “he”, your brain instantly recalls who “he” refers to — without sequentially rereading every previous word. That’s attention in action — selective recall instead of step-by-step memory traversal.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Let’s contrast how RNNs and Transformers handle sequences:
| Concept | RNN | Transformer |
|---|---|---|
| Memory Type | Hidden state ($h_t$) summarizing past | Attention weights linking all tokens directly |
| Processing | Sequential (one step at a time) | Fully parallel (all tokens at once) |
| Context Range | Local — decays over time | Global — any token can attend to any other |
| Order Awareness | Implicit via recurrence | Explicit via positional encodings |
| Scalability | O(T × H²) sequentially | O(T² × H) but parallelizable |
In essence:
RNNs pass a single thread of memory through time, while Transformers build a web of connections across every token in the sequence.
So, while RNNs propagate memory forward, Transformers broadcast memory everywhere simultaneously.
Why It Works This Way
The key idea behind attention is selectivity.
Each token doesn’t need to remember the entire past — only the relevant parts. The attention mechanism computes a weighted average of all other tokens in the sequence, assigning higher weights to the most contextually important ones.
Mathematically, attention computes:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$Here:
- $Q$ = Query (the current token’s request for context)
- $K$ = Key (the identity of other tokens)
- $V$ = Value (the content being shared)
The model learns which parts of the sequence matter most for predicting the next token — without being forced to step through every intermediate position.
How It Fits in ML Thinking
This shift from recurrence to attention is one of the most important paradigm changes in machine learning.
- RNNs → process sequences step-by-step; memory is sequential.
- Transformers → process sequences all at once; memory is relational.
Instead of a single evolving state ($h_t$), the Transformer builds direct relationships between all words simultaneously. This not only solves the vanishing gradient problem but also unleashes parallel training — a crucial advantage for scaling to billions of parameters.
Transformers are not “anti-RNN”; they’re the next logical step — retaining the idea of memory but redesigning how it flows.
📐 Step 3: Mathematical Foundation
Attention as Parallelized Recurrence
An RNN’s recurrence relation:
$$ h_t = f(W_{xh}x_t + W_{hh}h_{t-1}) $$Each $h_t$ depends only on $h_{t-1}$. This is a chain-like dependency.
A Transformer’s attention mechanism instead computes:
$$ h_t = \sum_j \alpha_{t,j} V_j, \quad \text{where } \alpha_{t,j} = \text{softmax}\left(\frac{Q_t K_j^T}{\sqrt{d_k}}\right) $$Now $h_t$ directly depends on all previous (and possibly future) tokens, not just $h_{t-1}$.
Positional Encoding
Since Transformers remove recurrence, they lose the natural sense of order. To restore sequence awareness, they inject positional encodings — special vectors added to input embeddings.
These encodings are sinusoidal or learned patterns that tell the model where each token appears in the sequence:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$This allows Transformers to differentiate between “the dog chased the cat” and “the cat chased the dog.”
🧠 Step 4: Assumptions or Key Ideas
- Attention = selective memory: lets the model focus on the most relevant context.
- Transformers = parallel sequence learners: compute dependencies globally instead of stepwise.
- Positional encodings reintroduce order without recurrence.
- The architecture scales efficiently on GPUs because all tokens are processed simultaneously.
⚖️ Step 5: Strengths, Limitations & Trade-offs
✅ Strengths
- Solves long-range dependency issues completely.
- Fully parallelized — drastically faster training and inference.
- Captures complex global relationships between sequence elements.
- Forms the foundation of large-scale models (GPT, BERT, etc.).
⚠️ Limitations
- Quadratic complexity O(T²) in sequence length due to pairwise attention.
- Requires massive computational resources and data.
- Can lose local structure if not trained or regularized properly.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Transformers are completely unrelated to RNNs.” → False — Transformers evolved from RNN limitations, keeping the idea of temporal dependencies but redesigning how they’re handled.
- “Transformers don’t need memory.” → They have distributed memory — attention weights store contextual relationships across tokens.
- “Positional encodings are optional.” → Not true — without them, Transformers lose all sense of order.
🧩 Step 7: Mini Summary
🧠 What You Learned: You explored how the concept of attention replaced recurrence, allowing models to access all past information directly instead of step-by-step.
⚙️ How It Works: Transformers use attention weights to connect all tokens simultaneously, with positional encodings restoring temporal order.
🎯 Why It Matters: This transition marked the leap from sequential, memory-based learners (RNNs) to parallel, context-rich architectures — laying the foundation for today’s large language models.