3.3. Initialization and Training Stability

5 min read 1014 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Every neural network — including Transformers — starts its journey from random weights. But “random” doesn’t mean careless. If we initialize too large, the network’s activations explode. If we initialize too small, they vanish before training even begins.

In very deep architectures like Transformers (often hundreds of layers), these tiny imbalances amplify exponentially as signals propagate forward and backward.

So, initialization is not a side detail — it’s the foundation of stable learning. Without it, your Transformer would be like trying to build a skyscraper on a sand dune.


  • Simple Analogy: Imagine a microphone and a speaker in the same room. If the volume (weights) is too high, you get feedback noise (exploding gradients). If it’s too low, you can’t hear anything (vanishing gradients). Proper initialization sets the “volume” just right so the sound (signal) travels clearly through every layer.

🌱 Step 2: Core Concept

Initialization, normalization, and gradient flow form a delicate balance in deep networks. Let’s break this down into three pillars:

  1. Smart Initialization (Xavier & Kaiming)
  2. Normalization as a Stability Partner (LayerNorm)
  3. What Happens Without It (Collapse & Chaos)

1️⃣ Xavier and Kaiming Initialization — Keeping Variance Balanced

When signals pass through many layers, their variance can grow or shrink depending on how weights are initialized. Good initialization keeps the variance of activations and gradients consistent layer-to-layer.

Xavier Initialization (Glorot): Used mainly with tanh or sigmoid activations. It balances forward and backward signal variance:

$$ Var(W) = \frac{2}{n_{in} + n_{out}} $$

Kaiming Initialization (He): Designed for ReLU (and GELU) activations. Since ReLU cuts off negative activations (introducing asymmetry), Kaiming compensates:

$$ Var(W) = \frac{2}{n_{in}} $$

Here:

  • $n_{in}$ = number of inputs to the neuron
  • $n_{out}$ = number of outputs

This ensures:

  • Forward pass: activations neither vanish nor blow up.
  • Backward pass: gradients stay stable and well-scaled.
Xavier and Kaiming are like thermostats — they set the ideal temperature (variance) so neither the signal nor gradients overheat or freeze as they travel through layers.

2️⃣ Layer Normalization — The Safety Net Against Instability

Even with careful initialization, deep Transformers face activation drift over time — certain neurons can dominate, skewing distributions.

That’s where Layer Normalization (LayerNorm) comes in: it rescales activations at every layer to maintain a stable mean and variance.

Formula:

$$ \text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta $$

where:

  • $\mu$ = mean of activations across features
  • $\sigma$ = standard deviation
  • $\gamma$, $\beta$ = learned scale and shift parameters

Together with good initialization, LayerNorm ensures:

  • Smooth activation distributions
  • Controlled gradient flow
  • Better convergence

Interaction Insight: LayerNorm complements initialization by constantly “re-centering” activations — like rebalancing a spinning top after every turn.

If initialization sets the stage for balance, LayerNorm is the ongoing act of keeping it upright — catching the model every time it leans too far toward instability.

3️⃣ Without Proper Initialization — A Recipe for Collapse

Let’s imagine three hypothetical Transformer training runs:

InitializationObservationOutcome
Too Small WeightsActivations and gradients shrink each layerModel learns nothing (flat loss)
Too Large WeightsGradients explodeModel oscillates or diverges
No LayerNormActivation variance drifts uncontrollablyLoss becomes unstable

You can visualize this as a chain of amplifiers: If one amp in the chain doubles the signal slightly, by layer 100, the signal grows $2^{100}$ times — a numerical disaster. If it halves the signal slightly, by layer 100, everything fades to zero.

Proper initialization keeps the signal gain ≈ 1. LayerNorm continuously corrects it when it drifts.

Training without good initialization is like whispering through 50 rooms — the message either gets lost or turns into noise before it reaches the last listener.

📐 Step 3: Mathematical Foundation

Variance Preservation Through Layers

Consider a layer:

$$ h = W x $$

To maintain stability, we want:

$$ Var(h) = Var(x) $$

If weights are initialized with $Var(W) = 1/n_{in}$, then:

$$ Var(h) = n_{in} Var(W) Var(x) = Var(x) $$

This ensures activations retain scale as they propagate. Too large → $Var(h)$ grows exponentially. Too small → $Var(h)$ shrinks exponentially.

Hence the choice of Xavier or Kaiming depending on activation function type.


Gradient Stability and Eigenvalues

When backpropagating, gradients are multiplied by weight matrices. Their stability depends on the eigenvalues of those matrices.

If eigenvalues » 1 → exploding gradients If eigenvalues « 1 → vanishing gradients

Good initialization ensures eigenvalues stay close to 1, meaning gradients flow evenly.

In Transformers, since each layer has multiple linear projections ($Q, K, V, FFN$), maintaining eigenvalue balance across all these layers is essential to prevent training collapse.


🧠 Step 4: Key Ideas

  • Initialization controls signal and gradient variance across layers.
  • Xavier and Kaiming methods preserve activation scale depending on activation type.
  • LayerNorm dynamically stabilizes activations during training.
  • Poor initialization or lack of normalization → numerical instability, gradient explosion, or total collapse.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Stable signal propagation → faster convergence.
  • Prevents vanishing/exploding gradients.
  • Enables deeper, larger models (hundreds of layers).
  • Slightly increases computational cost (LayerNorm per layer).
  • Initialization choices are sensitive to architecture details (activation, depth).
  • Over-normalization may dampen expressivity in shallow models.
Initialization = foundation, LayerNorm = ongoing correction. You can think of it like launching a rocket: initialization ensures a smooth liftoff, and LayerNorm keeps the flight path stable all the way to orbit.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Initialization only affects early training.” Wrong — poor initialization causes permanent instability that normalization can’t fully fix.
  • “LayerNorm replaces good initialization.” No — it complements it. They work best together.
  • “Deep Transformers automatically stabilize.” They require precise variance control at every layer to prevent cumulative imbalance.

🧩 Step 7: Mini Summary

🧠 What You Learned: Initialization sets the foundation for stable learning by preserving signal and gradient scales across layers.

⚙️ How It Works: Xavier and Kaiming maintain activation variance; LayerNorm corrects drift during training; without them, deep Transformers collapse.

🎯 Why It Matters: Initialization stability enables Transformers to train at scale — without it, no amount of data or compute can save a model from numerical chaos.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!