2.5. Encoder vs. Decoder Architecture

5 min read 903 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: The Transformer isn’t just one model — it’s a framework that can be shaped into different architectures depending on what we need: understanding, generating, or translating.

The key difference between Encoder, Decoder, and Encoder-Decoder designs is how they use attention — particularly, whether they can see the whole sequence (bidirectional) or only the past (causal masking).

In other words, it’s all about who gets to peek at what.


  • Simple Analogy: Imagine three kinds of students reading a passage:
  1. Encoder student (BERT) — reads the whole text at once to deeply understand it.
  2. Decoder student (GPT) — reads one word at a time, predicting the next word without peeking ahead.
  3. Encoder-Decoder student (T5) — one group reads and summarizes (encoder), another writes down the answer (decoder).

Each serves a different purpose, and that’s what makes these architectures powerful for different tasks.


🌱 Step 2: Core Concept

Let’s understand what these three forms of the Transformer architecture do — and how masking determines who sees what.


1️⃣ Encoder-Only Architecture — The Reader (BERT)
  • Reads the entire sequence at once.
  • Uses bidirectional self-attention, meaning every token can attend to every other token (before and after).
  • Great for understanding tasks — classification, sentiment analysis, question answering (where the goal is comprehension, not generation).

Example: BERT can understand the relationship between “bank” and “river” in

“The man sat by the river bank.” because it sees the whole sentence together — before and after context.

Masking: No causal mask here — all tokens see each other freely.


2️⃣ Decoder-Only Architecture — The Storyteller (GPT)
  • Generates text from left to right.
  • Uses causal (unidirectional) masking, meaning each token can only attend to previous tokens — never future ones.
  • Ideal for generation tasks — text completion, dialogue, summarization, and code generation.

Why Causal Masking? If the model could see the future tokens, it’d be cheating during training — it would already know what it’s supposed to predict next.

Implementation Conceptually: A causal mask zeroes out (or heavily penalizes) attention weights from future tokens:

Tokens:  [I] [love] [pizza]
Mask:     ✓    ✓     ✓
           X    ✓     ✓
           X    X     ✓

Each row shows what each token is allowed to see (✓ = visible, X = masked).

So “pizza” can see “I” and “love,” but “I” can’t see “love” or “pizza.”


3️⃣ Encoder–Decoder Architecture — The Translator (T5, BART)
  • Combines both worlds:

    • Encoder: Reads the full input (bidirectional attention).
    • Decoder: Generates output one token at a time (causal attention).
  • Decoder also attends to encoder outputs — a cross-attention layer connects them.

This setup is perfect for sequence-to-sequence tasks, such as translation or summarization:

Input (Encoder): “Translate English to French: I love pizza.” Output (Decoder): “J’adore la pizza.”

The encoder fully understands the input, and the decoder uses that understanding to generate the target sequence step by step.


📐 Step 3: Mathematical Foundation

Causal Masking Equation

In self-attention, we compute attention scores as

$$ S = \frac{QK^T}{\sqrt{d_k}} $$

To enforce causality, we apply a mask $M$ where

$$ M_{ij} = \begin{cases} 0, & j \le i \ -\infty, & j > i \end{cases} $$

Then apply softmax:

$$ \text{AttentionWeights} = \text{softmax}(S + M) $$
  • The $-\infty$ ensures softmax assigns zero probability to future tokens.
  • Thus, token $i$ can only attend to tokens $j \le i$ (the past and present).
Causal masking is like covering future words with sticky notes — the model only sees the past context and must guess what comes next.

Bidirectional vs. Unidirectional Context
Architecture TypeAttention TypeCan Look Ahead?Typical Use
Encoder-only (BERT)Bidirectional✅ YesUnderstanding / Classification
Decoder-only (GPT)Unidirectional (Causal)❌ NoGeneration / Completion
Encoder–Decoder (T5)Bi + Uni (Cross)Encoder: ✅ / Decoder: ❌Translation / Summarization
Think of BERT as reading the whole book, GPT as writing one page at a time, and T5 as reading one book and writing its summary.

🧠 Step 4: Key Ideas

  • The difference lies in the visibility mask — who can attend to whom.
  • Encoder-only models = complete context (great for understanding).
  • Decoder-only models = causal context (great for generation).
  • Encoder–Decoder models = two-phase design (read → write).
  • Causal masking prevents data leakage and ensures valid autoregressive generation.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Encoder-Only (BERT):

  • Full context → deep understanding.
  • Ideal for comprehension-based tasks.
  • Parallelizable during training.

Decoder-Only (GPT):

  • Natural for generation.
  • Simple architecture.
  • No bidirectional context → weaker understanding for ambiguous sentences.

Encoder–Decoder (T5):

  • Flexible (input → output mapping).
  • Slightly more complex due to cross-attention.
  • Heavier compute footprint but extremely powerful for multi-task learning.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Transformers always use bidirectional attention.” No — it depends on architecture and masking. GPT’s attention is strictly unidirectional.
  • “Masking is just for padding.” Causal masking enforces temporal logic (no peeking ahead), not just sequence length consistency.
  • “Encoder–decoder means double computation.” Only during training; during inference, decoders generate one token at a time.

🧩 Step 7: Mini Summary

🧠 What You Learned: Different Transformer variants handle context differently depending on whether they must understand, generate, or translate.

⚙️ How It Works: Encoders use bidirectional attention; decoders use causal attention to avoid future leakage; encoder–decoder hybrids combine both.

🎯 Why It Matters: Understanding the role of masking and attention direction helps explain why models like BERT, GPT, and T5 behave so differently — despite sharing the same Transformer backbone.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!