2.5. Encoder vs. Decoder Architecture
🪄 Step 1: Intuition & Motivation
- Core Idea: The Transformer isn’t just one model — it’s a framework that can be shaped into different architectures depending on what we need: understanding, generating, or translating.
The key difference between Encoder, Decoder, and Encoder-Decoder designs is how they use attention — particularly, whether they can see the whole sequence (bidirectional) or only the past (causal masking).
In other words, it’s all about who gets to peek at what.
- Simple Analogy: Imagine three kinds of students reading a passage:
- Encoder student (BERT) — reads the whole text at once to deeply understand it.
- Decoder student (GPT) — reads one word at a time, predicting the next word without peeking ahead.
- Encoder-Decoder student (T5) — one group reads and summarizes (encoder), another writes down the answer (decoder).
Each serves a different purpose, and that’s what makes these architectures powerful for different tasks.
🌱 Step 2: Core Concept
Let’s understand what these three forms of the Transformer architecture do — and how masking determines who sees what.
1️⃣ Encoder-Only Architecture — The Reader (BERT)
- Reads the entire sequence at once.
- Uses bidirectional self-attention, meaning every token can attend to every other token (before and after).
- Great for understanding tasks — classification, sentiment analysis, question answering (where the goal is comprehension, not generation).
Example: BERT can understand the relationship between “bank” and “river” in
“The man sat by the river bank.” because it sees the whole sentence together — before and after context.
Masking: No causal mask here — all tokens see each other freely.
2️⃣ Decoder-Only Architecture — The Storyteller (GPT)
- Generates text from left to right.
- Uses causal (unidirectional) masking, meaning each token can only attend to previous tokens — never future ones.
- Ideal for generation tasks — text completion, dialogue, summarization, and code generation.
Why Causal Masking? If the model could see the future tokens, it’d be cheating during training — it would already know what it’s supposed to predict next.
Implementation Conceptually: A causal mask zeroes out (or heavily penalizes) attention weights from future tokens:
Tokens: [I] [love] [pizza]
Mask: ✓ ✓ ✓
X ✓ ✓
X X ✓Each row shows what each token is allowed to see (✓ = visible, X = masked).
So “pizza” can see “I” and “love,” but “I” can’t see “love” or “pizza.”
3️⃣ Encoder–Decoder Architecture — The Translator (T5, BART)
Combines both worlds:
- Encoder: Reads the full input (bidirectional attention).
- Decoder: Generates output one token at a time (causal attention).
Decoder also attends to encoder outputs — a cross-attention layer connects them.
This setup is perfect for sequence-to-sequence tasks, such as translation or summarization:
Input (Encoder): “Translate English to French: I love pizza.” Output (Decoder): “J’adore la pizza.”
The encoder fully understands the input, and the decoder uses that understanding to generate the target sequence step by step.
📐 Step 3: Mathematical Foundation
Causal Masking Equation
In self-attention, we compute attention scores as
$$ S = \frac{QK^T}{\sqrt{d_k}} $$To enforce causality, we apply a mask $M$ where
$$ M_{ij} = \begin{cases} 0, & j \le i \ -\infty, & j > i \end{cases} $$Then apply softmax:
$$ \text{AttentionWeights} = \text{softmax}(S + M) $$- The $-\infty$ ensures softmax assigns zero probability to future tokens.
- Thus, token $i$ can only attend to tokens $j \le i$ (the past and present).
Bidirectional vs. Unidirectional Context
| Architecture Type | Attention Type | Can Look Ahead? | Typical Use |
|---|---|---|---|
| Encoder-only (BERT) | Bidirectional | ✅ Yes | Understanding / Classification |
| Decoder-only (GPT) | Unidirectional (Causal) | ❌ No | Generation / Completion |
| Encoder–Decoder (T5) | Bi + Uni (Cross) | Encoder: ✅ / Decoder: ❌ | Translation / Summarization |
🧠 Step 4: Key Ideas
- The difference lies in the visibility mask — who can attend to whom.
- Encoder-only models = complete context (great for understanding).
- Decoder-only models = causal context (great for generation).
- Encoder–Decoder models = two-phase design (read → write).
- Causal masking prevents data leakage and ensures valid autoregressive generation.
⚖️ Step 5: Strengths, Limitations & Trade-offs
Encoder-Only (BERT):
- Full context → deep understanding.
- Ideal for comprehension-based tasks.
- Parallelizable during training.
Decoder-Only (GPT):
- Natural for generation.
- Simple architecture.
- No bidirectional context → weaker understanding for ambiguous sentences.
Encoder–Decoder (T5):
- Flexible (input → output mapping).
- Slightly more complex due to cross-attention.
- Heavier compute footprint but extremely powerful for multi-task learning.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Transformers always use bidirectional attention.” No — it depends on architecture and masking. GPT’s attention is strictly unidirectional.
- “Masking is just for padding.” Causal masking enforces temporal logic (no peeking ahead), not just sequence length consistency.
- “Encoder–decoder means double computation.” Only during training; during inference, decoders generate one token at a time.
🧩 Step 7: Mini Summary
🧠 What You Learned: Different Transformer variants handle context differently depending on whether they must understand, generate, or translate.
⚙️ How It Works: Encoders use bidirectional attention; decoders use causal attention to avoid future leakage; encoder–decoder hybrids combine both.
🎯 Why It Matters: Understanding the role of masking and attention direction helps explain why models like BERT, GPT, and T5 behave so differently — despite sharing the same Transformer backbone.