3.3. Initialization and Training Stability
🪄 Step 1: Intuition & Motivation
- Core Idea: Every neural network — including Transformers — starts its journey from random weights. But “random” doesn’t mean careless. If we initialize too large, the network’s activations explode. If we initialize too small, they vanish before training even begins.
In very deep architectures like Transformers (often hundreds of layers), these tiny imbalances amplify exponentially as signals propagate forward and backward.
So, initialization is not a side detail — it’s the foundation of stable learning. Without it, your Transformer would be like trying to build a skyscraper on a sand dune.
- Simple Analogy: Imagine a microphone and a speaker in the same room. If the volume (weights) is too high, you get feedback noise (exploding gradients). If it’s too low, you can’t hear anything (vanishing gradients). Proper initialization sets the “volume” just right so the sound (signal) travels clearly through every layer.
🌱 Step 2: Core Concept
Initialization, normalization, and gradient flow form a delicate balance in deep networks. Let’s break this down into three pillars:
- Smart Initialization (Xavier & Kaiming)
- Normalization as a Stability Partner (LayerNorm)
- What Happens Without It (Collapse & Chaos)
1️⃣ Xavier and Kaiming Initialization — Keeping Variance Balanced
When signals pass through many layers, their variance can grow or shrink depending on how weights are initialized. Good initialization keeps the variance of activations and gradients consistent layer-to-layer.
Xavier Initialization (Glorot): Used mainly with tanh or sigmoid activations. It balances forward and backward signal variance:
$$ Var(W) = \frac{2}{n_{in} + n_{out}} $$Kaiming Initialization (He): Designed for ReLU (and GELU) activations. Since ReLU cuts off negative activations (introducing asymmetry), Kaiming compensates:
$$ Var(W) = \frac{2}{n_{in}} $$Here:
- $n_{in}$ = number of inputs to the neuron
- $n_{out}$ = number of outputs
This ensures:
- Forward pass: activations neither vanish nor blow up.
- Backward pass: gradients stay stable and well-scaled.
2️⃣ Layer Normalization — The Safety Net Against Instability
Even with careful initialization, deep Transformers face activation drift over time — certain neurons can dominate, skewing distributions.
That’s where Layer Normalization (LayerNorm) comes in: it rescales activations at every layer to maintain a stable mean and variance.
Formula:
$$ \text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta $$where:
- $\mu$ = mean of activations across features
- $\sigma$ = standard deviation
- $\gamma$, $\beta$ = learned scale and shift parameters
Together with good initialization, LayerNorm ensures:
- Smooth activation distributions
- Controlled gradient flow
- Better convergence
Interaction Insight: LayerNorm complements initialization by constantly “re-centering” activations — like rebalancing a spinning top after every turn.
3️⃣ Without Proper Initialization — A Recipe for Collapse
Let’s imagine three hypothetical Transformer training runs:
| Initialization | Observation | Outcome |
|---|---|---|
| Too Small Weights | Activations and gradients shrink each layer | Model learns nothing (flat loss) |
| Too Large Weights | Gradients explode | Model oscillates or diverges |
| No LayerNorm | Activation variance drifts uncontrollably | Loss becomes unstable |
You can visualize this as a chain of amplifiers: If one amp in the chain doubles the signal slightly, by layer 100, the signal grows $2^{100}$ times — a numerical disaster. If it halves the signal slightly, by layer 100, everything fades to zero.
Proper initialization keeps the signal gain ≈ 1. LayerNorm continuously corrects it when it drifts.
📐 Step 3: Mathematical Foundation
Variance Preservation Through Layers
Consider a layer:
$$ h = W x $$To maintain stability, we want:
$$ Var(h) = Var(x) $$If weights are initialized with $Var(W) = 1/n_{in}$, then:
$$ Var(h) = n_{in} Var(W) Var(x) = Var(x) $$This ensures activations retain scale as they propagate. Too large → $Var(h)$ grows exponentially. Too small → $Var(h)$ shrinks exponentially.
Hence the choice of Xavier or Kaiming depending on activation function type.
Gradient Stability and Eigenvalues
When backpropagating, gradients are multiplied by weight matrices. Their stability depends on the eigenvalues of those matrices.
If eigenvalues » 1 → exploding gradients If eigenvalues « 1 → vanishing gradients
Good initialization ensures eigenvalues stay close to 1, meaning gradients flow evenly.
In Transformers, since each layer has multiple linear projections ($Q, K, V, FFN$), maintaining eigenvalue balance across all these layers is essential to prevent training collapse.
🧠 Step 4: Key Ideas
- Initialization controls signal and gradient variance across layers.
- Xavier and Kaiming methods preserve activation scale depending on activation type.
- LayerNorm dynamically stabilizes activations during training.
- Poor initialization or lack of normalization → numerical instability, gradient explosion, or total collapse.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Stable signal propagation → faster convergence.
- Prevents vanishing/exploding gradients.
- Enables deeper, larger models (hundreds of layers).
- Slightly increases computational cost (LayerNorm per layer).
- Initialization choices are sensitive to architecture details (activation, depth).
- Over-normalization may dampen expressivity in shallow models.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Initialization only affects early training.” Wrong — poor initialization causes permanent instability that normalization can’t fully fix.
- “LayerNorm replaces good initialization.” No — it complements it. They work best together.
- “Deep Transformers automatically stabilize.” They require precise variance control at every layer to prevent cumulative imbalance.
🧩 Step 7: Mini Summary
🧠 What You Learned: Initialization sets the foundation for stable learning by preserving signal and gradient scales across layers.
⚙️ How It Works: Xavier and Kaiming maintain activation variance; LayerNorm corrects drift during training; without them, deep Transformers collapse.
🎯 Why It Matters: Initialization stability enables Transformers to train at scale — without it, no amount of data or compute can save a model from numerical chaos.