5.2. Transition to Attention and Transformers

Deep Learning Interview Prep: The Ultimate Guide (2025)

Recurrent Neural Networks (RNNs)

5 min read 962 words

🪄 Step 1: Intuition & Motivation

Core Idea: RNNs were revolutionary — they gave machines the ability to process sequences instead of static data. But as we’ve seen, their sequential nature makes them slow and forgetful. The world needed a model that could:
1. See the entire sequence at once, not step-by-step.
2. Focus selectively on what matters most — whether that’s a nearby word or something from far back in the past.
That’s the birth of Attention, and later, Transformers — architectures that replaced recurrence with direct, parallelized communication across all time steps.
Simple Analogy: Imagine reading an entire page of a book instead of one word at a time. When you spot the word “he”, your brain instantly recalls who “he” refers to — without sequentially rereading every previous word. That’s attention in action — selective recall instead of step-by-step memory traversal.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Let’s contrast how RNNs and Transformers handle sequences:

Concept	RNN	Transformer
Memory Type	Hidden state ($h_t$) summarizing past	Attention weights linking all tokens directly
Processing	Sequential (one step at a time)	Fully parallel (all tokens at once)
Context Range	Local — decays over time	Global — any token can attend to any other
Order Awareness	Implicit via recurrence	Explicit via positional encodings
Scalability	O(T × H²) sequentially	O(T² × H) but parallelizable

In essence:

RNNs pass a single thread of memory through time, while Transformers build a web of connections across every token in the sequence.

So, while RNNs propagate memory forward, Transformers broadcast memory everywhere simultaneously.

Why It Works This Way

The key idea behind attention is selectivity.

Each token doesn’t need to remember the entire past — only the relevant parts. The attention mechanism computes a weighted average of all other tokens in the sequence, assigning higher weights to the most contextually important ones.

Mathematically, attention computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

Here:

$Q$ = Query (the current token’s request for context)
$K$ = Key (the identity of other tokens)
$V$ = Value (the content being shared)

The model learns which parts of the sequence matter most for predicting the next token — without being forced to step through every intermediate position.

How It Fits in ML Thinking

This shift from recurrence to attention is one of the most important paradigm changes in machine learning.

RNNs → process sequences step-by-step; memory is sequential.
Transformers → process sequences all at once; memory is relational.

Instead of a single evolving state ($h_t$), the Transformer builds direct relationships between all words simultaneously. This not only solves the vanishing gradient problem but also unleashes parallel training — a crucial advantage for scaling to billions of parameters.

Transformers are not “anti-RNN”; they’re the next logical step — retaining the idea of memory but redesigning how it flows.

📐 Step 3: Mathematical Foundation

Attention as Parallelized Recurrence

An RNN’s recurrence relation:

$$ h_t = f(W_{xh}x_t + W_{hh}h_{t-1}) $$

Each $h_t$ depends only on $h_{t-1}$. This is a chain-like dependency.

A Transformer’s attention mechanism instead computes:

$$ h_t = \sum_j \alpha_{t,j} V_j, \quad \text{where } \alpha_{t,j} = \text{softmax}\left(\frac{Q_t K_j^T}{\sqrt{d_k}}\right) $$

Now $h_t$ directly depends on all previous (and possibly future) tokens, not just $h_{t-1}$.

Think of it like replacing a domino chain (RNN) with a fully connected web (Transformer). Each token talks directly to every other token instead of waiting for the signal to travel down the line.

Positional Encoding

Since Transformers remove recurrence, they lose the natural sense of order. To restore sequence awareness, they inject positional encodings — special vectors added to input embeddings.

These encodings are sinusoidal or learned patterns that tell the model where each token appears in the sequence:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$

This allows Transformers to differentiate between “the dog chased the cat” and “the cat chased the dog.”

Think of positional encoding as assigning each word a unique rhythm — the model listens to all words together, but knows who came first by their distinct beat.

🧠 Step 4: Assumptions or Key Ideas

Attention = selective memory: lets the model focus on the most relevant context.
Transformers = parallel sequence learners: compute dependencies globally instead of stepwise.
Positional encodings reintroduce order without recurrence.
The architecture scales efficiently on GPUs because all tokens are processed simultaneously.

⚖️ Step 5: Strengths, Limitations & Trade-offs

✅ Strengths

Solves long-range dependency issues completely.
Fully parallelized — drastically faster training and inference.
Captures complex global relationships between sequence elements.
Forms the foundation of large-scale models (GPT, BERT, etc.).

⚠️ Limitations

Quadratic complexity O(T²) in sequence length due to pairwise attention.
Requires massive computational resources and data.
Can lose local structure if not trained or regularized properly.

⚖️ Trade-offs Transformers replace temporal efficiency (good for streaming) with parallel scalability (good for large datasets). For long-term reasoning and global context, Transformers dominate; for on-device, low-latency applications, RNNs remain useful.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Transformers are completely unrelated to RNNs.” → False — Transformers evolved from RNN limitations, keeping the idea of temporal dependencies but redesigning how they’re handled.
“Transformers don’t need memory.” → They have distributed memory — attention weights store contextual relationships across tokens.
“Positional encodings are optional.” → Not true — without them, Transformers lose all sense of order.

🧩 Step 7: Mini Summary

🧠 What You Learned: You explored how the concept of attention replaced recurrence, allowing models to access all past information directly instead of step-by-step.

⚙️ How It Works: Transformers use attention weights to connect all tokens simultaneously, with positional encodings restoring temporal order.

🎯 Why It Matters: This transition marked the leap from sequential, memory-based learners (RNNs) to parallel, context-rich architectures — laying the foundation for today’s large language models.

5.3. Practical Interview Prep 5.1. Limitations of RNNs