2.1. Derive BPTT Mathematically

6 min read 1072 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: When training an RNN, we don’t just learn from a single input–output pair — we learn from the entire sequence. But here’s the tricky part: since each time step depends on the previous one, the gradient at the current step also depends on all the past steps. To handle this, RNNs use a special form of backpropagation called Backpropagation Through Time (BPTT) — it unfolds the RNN across time and computes gradients step by step, reaching backward through the sequence.

  • Simple Analogy: Imagine watching a movie, and at the end, you decide whether it was good or bad. If you want to figure out why you felt that way, you mentally rewind and recall key scenes that influenced your opinion. BPTT works the same way — it “rewinds through time” to assign credit or blame to earlier time steps for the final prediction error.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When training a standard feedforward network, we compute gradients layer by layer, from the output back to the input. For an RNN, however, the layers are spread across time — each time step is a layer in the unrolled network.

During the forward pass:

  1. Each $x_t$ (input at time $t$) produces a hidden state $h_t$ and output $y_t$.
  2. The total loss $L$ is the sum of losses from all time steps: $$ L = \sum_t L_t(y_t, \hat{y_t}) $$

During the backward pass (BPTT):

  • We compute $\frac{\partial L}{\partial W_{xh}}, \frac{\partial L}{\partial W_{hh}}, \frac{\partial L}{\partial W_{hy}}$ — gradients for each parameter — by moving back through time.
  • Each hidden state $h_t$ affects not just the current output $y_t$, but all future outputs $y_{t+1}, y_{t+2}, \dots$
  • Hence, the gradient for $W_{hh}$ accumulates contributions from all time steps.

That’s why it’s called Backpropagation Through Time: the network revisits its own history.


Why It Works This Way

An RNN “remembers” through recurrence — but that also means its gradients have to flow through those same recurrent links.

At each step, the error signal passes through the chain of dependencies:

$$ \frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} $$

But since $h_t$ depends on $h_{t-1}$, we need to expand that dependency backward:

$$ \frac{\partial h_t}{\partial W_{hh}} = \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}} $$

This creates a recursive chain of derivatives across time steps.

Each chain multiplies by the Jacobian matrix $\frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot f’(a_{t-1})$. When we multiply many such matrices (one for each time step), small numbers (<1) shrink rapidly — vanishing gradients — and large numbers (>1) grow exponentially — exploding gradients.


How It Fits in ML Thinking

Backpropagation Through Time is how RNNs learn dependencies across time. It’s the bridge that connects the temporal chain of computation (the RNN’s structure) to gradient-based optimization (the training process).

Understanding BPTT means understanding how learning flows backward — and why long-term dependencies are hard for RNNs to capture. This concept is what inspired LSTMs and GRUs — models that modify the RNN structure to stabilize gradient flow.


📐 Step 3: Mathematical Foundation

Loss Function Over Time
$$ L = \sum_{t=1}^{T} L_t(y_t, \hat{y_t}) $$

Each $L_t$ is the error at time step $t$. During training, we want to minimize this total loss with respect to all parameters.

Think of each $L_t$ as a tiny report card for that time step. The final grade ($L$) is the average of all report cards. The model learns by adjusting parameters so all report cards improve simultaneously.

Recursive Gradient Accumulation

The gradient of the total loss with respect to recurrent weights is:

$$ \frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}} $$

Because each $h_t$ depends on $h_{t-1}$:

$$ \frac{\partial h_t}{\partial W_{hh}} = \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial W_{hh}} $$

Expanding this through time yields:

$$ \frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L_t}{\partial h_t} \prod_{k < t} (W_{hh}^T \cdot f'(a_k)) $$

This repeated product of Jacobians is what causes gradient instability.


Gradient Explosion / Vanishing

The term

$$ \prod_{k} W_{hh}^T \cdot f'(a_k) $$

acts as a time-propagated multiplier.

  • If the largest eigenvalue of $W_{hh}$ is < 1, the gradient shrinks exponentially → vanishing gradient.
  • If it’s > 1, the gradient grows uncontrollably → exploding gradient.
Think of it like a game of telephone: If the message (gradient) is slightly distorted each time it’s passed along, after many steps it becomes either a faint whisper (vanishing) or a chaotic shout (exploding).

🧠 Step 4: Assumptions or Key Ideas

  • The RNN is unrolled over time, meaning we view each time step as a layer in a deep network.
  • All time steps share weights — hence gradients accumulate across steps.
  • BPTT assumes we can compute gradients across the entire sequence, which becomes computationally expensive for long inputs.
  • In practice, we often use Truncated BPTT — we unroll only for a limited number of steps (say, 10–20) to balance accuracy and efficiency.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Strengths

  • Enables learning from sequential dependencies through gradient-based optimization.
  • Mathematically principled — directly derived from chain rule.
  • Forms the training foundation for all recurrent models (RNN, LSTM, GRU).

⚠️ Limitations

  • Prone to vanishing/exploding gradients, making it hard to learn long-term dependencies.
  • Requires high computational cost — especially for long sequences.
  • Sensitive to initialization of $W_{hh}$ and choice of activation function.
⚖️ Trade-offs BPTT gives precision (exact gradients) but sacrifices efficiency. Many modern models use approximations like truncated BPTT or architectures like LSTMs that reduce gradient instability while keeping the same learning principle.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “BPTT is a new algorithm.” → It’s not separate from backpropagation — it’s just backprop applied across time.
  • “Vanishing gradients mean the network isn’t learning.” → It’s still learning short-term dependencies — it just can’t retain long-term memory.
  • “You can avoid vanishing gradients by using ReLU.” → Not always. The recurrent weight matrix $W_{hh}$ itself causes instability, not just the activation.

🧩 Step 7: Mini Summary

🧠 What You Learned: You explored how gradients are computed across time using Backpropagation Through Time (BPTT).

⚙️ How It Works: The loss is summed across time steps, and gradients flow backward through recurrent links — multiplying Jacobians along the way.

🎯 Why It Matters: Understanding BPTT is essential to grasp why RNNs struggle with long-term dependencies and how LSTMs and GRUs were designed to fix it.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!