2.5. Encoder vs. Decoder Architecture

Generative AI & LLM Interview Guide for Top Roles (2025)

5 min read 903 words

🪄 Step 1: Intuition & Motivation

Core Idea: The Transformer isn’t just one model — it’s a framework that can be shaped into different architectures depending on what we need: understanding, generating, or translating.

The key difference between Encoder, Decoder, and Encoder-Decoder designs is how they use attention — particularly, whether they can see the whole sequence (bidirectional) or only the past (causal masking).

In other words, it’s all about who gets to peek at what.

Simple Analogy: Imagine three kinds of students reading a passage:

Encoder student (BERT) — reads the whole text at once to deeply understand it.
Decoder student (GPT) — reads one word at a time, predicting the next word without peeking ahead.
Encoder-Decoder student (T5) — one group reads and summarizes (encoder), another writes down the answer (decoder).

Each serves a different purpose, and that’s what makes these architectures powerful for different tasks.

🌱 Step 2: Core Concept

Let’s understand what these three forms of the Transformer architecture do — and how masking determines who sees what.

1️⃣ Encoder-Only Architecture — The Reader (BERT)

Reads the entire sequence at once.
Uses bidirectional self-attention, meaning every token can attend to every other token (before and after).
Great for understanding tasks — classification, sentiment analysis, question answering (where the goal is comprehension, not generation).

Example: BERT can understand the relationship between “bank” and “river” in

“The man sat by the river bank.” because it sees the whole sentence together — before and after context.

Masking: No causal mask here — all tokens see each other freely.

2️⃣ Decoder-Only Architecture — The Storyteller (GPT)

Generates text from left to right.
Uses causal (unidirectional) masking, meaning each token can only attend to previous tokens — never future ones.
Ideal for generation tasks — text completion, dialogue, summarization, and code generation.

Why Causal Masking? If the model could see the future tokens, it’d be cheating during training — it would already know what it’s supposed to predict next.

Implementation Conceptually: A causal mask zeroes out (or heavily penalizes) attention weights from future tokens:

Tokens:  [I] [love] [pizza]
Mask:     ✓    ✓     ✓
           X    ✓     ✓
           X    X     ✓

Each row shows what each token is allowed to see (✓ = visible, X = masked).

So “pizza” can see “I” and “love,” but “I” can’t see “love” or “pizza.”

3️⃣ Encoder–Decoder Architecture — The Translator (T5, BART)

Combines both worlds:
- Encoder: Reads the full input (bidirectional attention).
- Decoder: Generates output one token at a time (causal attention).
Decoder also attends to encoder outputs — a cross-attention layer connects them.

This setup is perfect for sequence-to-sequence tasks, such as translation or summarization:

Input (Encoder): “Translate English to French: I love pizza.” Output (Decoder): “J’adore la pizza.”

The encoder fully understands the input, and the decoder uses that understanding to generate the target sequence step by step.

📐 Step 3: Mathematical Foundation

Causal Masking Equation

In self-attention, we compute attention scores as

$$ S = \frac{QK^T}{\sqrt{d_k}} $$

To enforce causality, we apply a mask $M$ where

$$ M_{ij} = \begin{cases} 0, & j \le i \ -\infty, & j > i \end{cases} $$

Then apply softmax:

$$ \text{AttentionWeights} = \text{softmax}(S + M) $$

The $-\infty$ ensures softmax assigns zero probability to future tokens.
Thus, token $i$ can only attend to tokens $j \le i$ (the past and present).

Causal masking is like covering future words with sticky notes — the model only sees the past context and must guess what comes next.

Bidirectional vs. Unidirectional Context

Architecture Type	Attention Type	Can Look Ahead?	Typical Use
Encoder-only (BERT)	Bidirectional	✅ Yes	Understanding / Classification
Decoder-only (GPT)	Unidirectional (Causal)	❌ No	Generation / Completion
Encoder–Decoder (T5)	Bi + Uni (Cross)	Encoder: ✅ / Decoder: ❌	Translation / Summarization

Think of BERT as reading the whole book, GPT as writing one page at a time, and T5 as reading one book and writing its summary.

🧠 Step 4: Key Ideas

The difference lies in the visibility mask — who can attend to whom.
Encoder-only models = complete context (great for understanding).
Decoder-only models = causal context (great for generation).
Encoder–Decoder models = two-phase design (read → write).
Causal masking prevents data leakage and ensures valid autoregressive generation.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Encoder-Only (BERT):

Full context → deep understanding.
Ideal for comprehension-based tasks.
Parallelizable during training.

Decoder-Only (GPT):

Natural for generation.
Simple architecture.
No bidirectional context → weaker understanding for ambiguous sentences.

Encoder–Decoder (T5):

Flexible (input → output mapping).
Slightly more complex due to cross-attention.
Heavier compute footprint but extremely powerful for multi-task learning.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Transformers always use bidirectional attention.” No — it depends on architecture and masking. GPT’s attention is strictly unidirectional.
“Masking is just for padding.” Causal masking enforces temporal logic (no peeking ahead), not just sequence length consistency.
“Encoder–decoder means double computation.” Only during training; during inference, decoders generate one token at a time.

🧩 Step 7: Mini Summary

🧠 What You Learned: Different Transformer variants handle context differently depending on whether they must understand, generate, or translate.

⚙️ How It Works: Encoders use bidirectional attention; decoders use causal attention to avoid future leakage; encoder–decoder hybrids combine both.

🎯 Why It Matters: Understanding the role of masking and attention direction helps explain why models like BERT, GPT, and T5 behave so differently — despite sharing the same Transformer backbone.

3.1. Linear Algebra Refresher 2.4. Feed-Forward Layers and Residual Connections