2.4. Feed-Forward Layers and Residual Connections

2.4. Feed-Forward Layers and Residual Connections

5 min read 947 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: After each attention block, the Transformer doesn’t stop — it needs to further process and transform the attended information. That’s where the Feed-Forward Network (FFN) comes in: it’s like the “thinking” step after “listening.”

But here’s the problem — stacking many layers can make gradients unstable and slow down learning. So, the Transformer adds residual connections (shortcuts for information flow) and Layer Normalization (to keep activations stable).

Together, they form what’s often called the FFN Sandwich:

Attention → Add & Norm → Feed-Forward → Add & Norm


  • Simple Analogy: Think of a Transformer layer like a team discussion.
  • Attention is the listening round — everyone hears from everyone else.
  • FFN is the reflection round — each person processes what they heard privately.
  • Residual connections are like note-passing between meetings — no one forgets the earlier context.
  • LayerNorm keeps everyone speaking at the same “volume,” so no one dominates or disappears.

🌱 Step 2: Core Concept

Let’s unpack the three building blocks that make this layer work like a finely tuned orchestra.


1️⃣ The Feed-Forward Network (FFN)

After attention mixes information across tokens, the FFN operates independently on each token’s representation.

It’s a small, two-layer neural network:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$
  • $x$: the token’s embedding after attention
  • $W_1$: expands dimensionality (e.g., from 512 → 2048)
  • $W_2$: projects it back down (2048 → 512)
  • ReLU (or GELU) adds non-linearity

So, the FFN helps each token refine its internal representation — like “digesting” the context it just received from attention.

Attention mixes across tokens (horizontal thinking). FFN transforms within tokens (vertical deepening).

2️⃣ Residual Connections — The Shortcut Highway

Residuals add a shortcut from the layer’s input to its output:

$$ y = x + F(x) $$

This means the network doesn’t need to relearn identity mappings — it can add improvements instead of starting from scratch each layer.

Benefits:

  • Prevents vanishing gradients — gradients can flow directly through the skip path.
  • Speeds up convergence — layers can refine representations incrementally.
  • Makes deep models more stable.

It’s like giving the model a safety net:

“If you can’t improve much, at least don’t ruin what’s already working.”


3️⃣ Layer Normalization — Keeping Signals Balanced

Deep layers can make activations blow up or vanish due to compounding transformations. LayerNorm fixes that by normalizing across features:

$$ \text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta $$
  • $\mu$: mean of all features for one token
  • $\sigma$: standard deviation
  • $\gamma$, $\beta$: learned scaling and shifting

This ensures every token’s feature vector has a consistent scale and distribution.

It’s like adjusting microphone levels so everyone speaks clearly — not too loud, not too soft.


4️⃣ The Complete Transformer Block — The FFN Sandwich

Putting it all together:

  1. Input goes into Multi-Head Attention (MHA).
  2. Output is added back to input via a residual connection, then LayerNorm is applied.
  3. That normalized output is fed into the Feed-Forward Network (FFN).
  4. Another residual + LayerNorm follows.

So, the structure looks like:

x → [MHA] → Add & Norm → [FFN] → Add & Norm → output

Each step stabilizes the next — balancing learning and preventing signal explosion.


📐 Step 3: Mathematical Foundation

Residual Learning Equation
$$ \begin{aligned} \tilde{x} &= \text{LayerNorm}(x + \text{MHA}(x)) \ y &= \text{LayerNorm}(\tilde{x} + \text{FFN}(\tilde{x})) \end{aligned} $$
  • The model learns residual functions: $\text{MHA}(x)$ and $\text{FFN}(x)$.
  • These residuals modify the input just enough to improve it — avoiding overwriting stable knowledge.
Each layer says, “Here’s a small useful tweak to the current understanding.” Residuals ensure learning is evolutionary, not destructive.

Pre-Norm vs. Post-Norm Debate

Two major architectures handle LayerNorm placement differently:

  1. Post-Norm (Original Transformer): Apply LayerNorm after residual addition.

    y = LayerNorm(x + SubLayer(x))
    • More straightforward
    • Works fine for shallow networks
  2. Pre-Norm (Transformer-XL, GPT): Apply LayerNorm before sublayer.

    y = x + SubLayer(LayerNorm(x))
    • More stable for deep Transformers (helps gradient flow)
    • Avoids exploding activations

Why the difference? In deeper networks, Post-Norm can cause gradients to vanish through many stacked norms. Pre-Norm ensures gradients flow more directly from top to bottom.

  • Use Post-Norm for shallow encoders (like BERT-base).
  • Use Pre-Norm for large decoders (like GPT-style LLMs).

🧠 Step 4: Key Ideas

  • Residuals = gradient highways 🚗
  • LayerNorm = activation stabilizer ⚖️
  • FFN = nonlinear transformer within each token 🧩
  • Pre-Norm helps very deep models converge faster and stay numerically stable.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Prevents vanishing/exploding gradients.
  • Enables extremely deep Transformers (100+ layers).
  • Maintains stability through normalization.
  • Adds computational overhead (two norms per layer).
  • Residual paths can slow adaptation — the model may rely too much on shortcuts.
  • Choosing pre- vs post-norm affects convergence and training time.
Residuals make training smoother but require careful norm placement. It’s like having both a highway and local road: the highway moves fast, but you still need well-designed exits (LayerNorm) to keep things controlled.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Residuals just copy the input.” Not quite — they preserve information flow but combine it with learned refinements.
  • “LayerNorm is for regularization.” Its main role is stabilization, not preventing overfitting.
  • “Pre-Norm always beats Post-Norm.” Depends on depth — Pre-Norm helps deep models, but Post-Norm can perform slightly better on small ones.

🧩 Step 7: Mini Summary

🧠 What You Learned: Feed-Forward Networks refine token-wise representations, while residuals and LayerNorm ensure smooth and stable learning across deep layers.

⚙️ How It Works: The Transformer layer forms an “FFN sandwich”: Attention → Add & Norm → FFN → Add & Norm, maintaining balance between depth and stability.

🎯 Why It Matters: Without residuals and normalization, large Transformers would crumble under unstable gradients or slow convergence.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!