2.3. Chain Rule & Backpropagation

7 min read 1333 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Backpropagation is just the chain rule wearing a hard hat. We compute how a tiny nudge to an input or a weight ripples forward through a model and then flow the blame backward to update each weight appropriately.

  • Simple Analogy: Imagine a factory assembly line that builds a toy in stages. If the final toy has a flaw, you trace back through the stations to see who contributed what amount to the mistake. That trace-back is backpropagation; the math that allows it is the chain rule.


🌱 Step 2: Core Concept

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">What’s Happening Under the Hood?</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <ol>
<li><strong>Forward pass</strong>: Inputs move through layers to produce an output and a loss, while we quietly store the “intermediate outputs” each station produced.</li>
<li><strong>Backward pass</strong>: Starting from the loss, we use the <strong>chain rule</strong> to pass responsibility (gradients) backward: each node/layer receives a gradient from the next layer and multiplies it by its own local derivative.</li>
<li><strong>Update</strong>: We change each parameter a tiny bit opposite the gradient to reduce the loss next time.</li>
</ol>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Why It Works This Way</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p>The chain rule decomposes a complicated dependency into small, local pieces. Every node only needs:</p>
<ul>
<li>the gradient arriving from “above” (its output’s gradient), and</li>
<li>its <strong>local derivative</strong> (how its output changes with respect to its inputs/parameters).
Multiplying these gives the gradient wrt its inputs and parameters. Local, simple math adds up to a global, powerful update.</li>
</ul>
  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">How It Fits in ML Thinking</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    Backprop lets deep models <strong>learn end-to-end</strong>: errors at the end teach early layers how to shape their representations. It’s the bridge from <em>loss signals</em> to <em>feature learning</em> across dozens (or hundreds) of layers.
  </div>
</details>

📐 Step 3: Mathematical Foundation

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Chain Rule: Single Path</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p>If $y = f(u)$ and $u = g(x)$, then
$ \frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx} $.</p>
<p><strong>Intuitive read</strong>: change in $x$ affects $u$, which affects $y$. Multiply the local effects along the path.</p>
<div class="hx-overflow-x-auto hx-mt-6 hx-flex hx-rounded-lg hx-border hx-py-2 ltr:hx-pr-4 rtl:hx-pl-4 contrast-more:hx-border-current contrast-more:dark:hx-border-current hx-border-orange-100 hx-bg-orange-50 hx-text-orange-800 dark:hx-border-orange-400/30 dark:hx-bg-orange-400/20 dark:hx-text-orange-300">
  <div class="ltr:hx-pl-3 ltr:hx-pr-2 rtl:hx-pr-3 rtl:hx-pl-2"></div>
  <div class="hx-w-full hx-min-w-0 hx-leading-7">
    <div class="hx-mt-6 hx-leading-7 first:hx-mt-0">Think “links in a chain”: each link contributes a factor; the total effect is the product.</div>
  </div>
</div>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Multi-Variable Chain Rule (Vector Form)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p>For $,\mathbf{z}=g(\mathbf{x}) \in \mathbb{R}^m,$ and $,y=f(\mathbf{z}) \in \mathbb{R},$:
</p>
$$ \nabla_{\mathbf{x}} y \;=\; J_{g}(\mathbf{x})^\top \,\nabla_{\mathbf{z}} y $$<p>
where $J_g$ is the <strong>Jacobian</strong> of $g$.</p>
<p><strong>Reading</strong>: to know how $x$ changes $y$, first see how $z$ changes $y$ (incoming gradient), then project that back through how $x$ changes $z$ (Jacobian).</p>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Layerwise Backprop: Linear → Activation</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p>Consider a common block:</p>
<ul>
<li>Linear: $ \mathbf{z} = W\mathbf{x} + \mathbf{b} $</li>
<li>Activation: $ \mathbf{a} = \phi(\mathbf{z}) $</li>
<li>Loss: $ L(\mathbf{a}) $</li>
</ul>
<p>Backward:</p>
<ol>
<li>From loss to activation: $ \frac{\partial L}{\partial \mathbf{a}} $ (given by upstream).</li>
<li>Activation local derivative: $ \frac{\partial L}{\partial \mathbf{z}} = \frac{\partial L}{\partial \mathbf{a}} \odot \phi&rsquo;(\mathbf{z}) $</li>
<li>Linear params:
<ul>
<li>$ \frac{\partial L}{\partial W} = \left(\frac{\partial L}{\partial \mathbf{z}}\right)\mathbf{x}^\top $</li>
<li>$ \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}} $</li>
<li>$ \frac{\partial L}{\partial \mathbf{x}} = W^\top \left(\frac{\partial L}{\partial \mathbf{z}}\right) $</li>
</ul>
</li>
</ol>
<p>($\odot$ is elementwise multiplication.)</p>
<div class="hx-overflow-x-auto hx-mt-6 hx-flex hx-rounded-lg hx-border hx-py-2 ltr:hx-pr-4 rtl:hx-pl-4 contrast-more:hx-border-current contrast-more:dark:hx-border-current hx-border-orange-100 hx-bg-orange-50 hx-text-orange-800 dark:hx-border-orange-400/30 dark:hx-bg-orange-400/20 dark:hx-text-orange-300">
  <div class="ltr:hx-pl-3 ltr:hx-pr-2 rtl:hx-pr-3 rtl:hx-pl-2"></div>
  <div class="hx-w-full hx-min-w-0 hx-leading-7">
    <div class="hx-mt-6 hx-leading-7 first:hx-mt-0">Each weight in $W$ is updated in proportion to “how much its input $\mathbf{x}$ showed up” times “how much error sensitivity reached this neuron.”</div>
  </div>
</div>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Non-Smooth Activations &amp; Subgradients (ReLU)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    ReLU: $ \phi(z)=\max(0,z) $ has
$ \phi&rsquo;(z) = 1 $ if $z&gt;0$, and $0$ if $z&lt;0$; at $z=0$ we pick a <strong>subgradient</strong> in $[0,1]$.
This keeps gradients flowing where units are active and blocks them where they’re off.
  </div>
</details>

🧠 Step 4: Assumptions or Key Ideas

- Computation graph is a **DAG** (no cycles during a single pass); we can topologically order nodes.
- Each node exposes a **local derivative** that is cheap to compute.
- Intermediate values from the forward pass are cached for efficient backward computation.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Scales to very deep models via local, composable derivatives.
  • Reverse-mode (backprop) is efficient for many-parameters, single-scalar loss.
  • Works with any differentiable building block you can compose.
  • Vanishing/exploding gradients in deep chains can stall or destabilize training.
  • Requires storing intermediates (memory-heavy for large nets).
  • Nondifferentiable points need subgradients/approximations.
  • Stability vs. Expressiveness: smoother activations (e.g., GELU) improve gradient flow; sharper ones (e.g., hard thresholds) can hurt it but may yield sparsity.
  • Memory vs. Speed: checkpointing trades recomputation for memory; mixed precision trades numeric range for throughput.

🚧 Step 6: Common Misunderstandings (Optional)

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">🚨 Common Misunderstandings (Click to Expand)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <ul>
<li>“Backprop is separate from the chain rule.”<br>
→ It <strong>is</strong> the chain rule, organized along a graph.</li>
<li>“Order doesn’t matter in the backward pass.”<br>
→ It does. We must follow <strong>reverse topological order</strong> so every node receives correct upstream gradients.</li>
<li>“Automatic differentiation = symbolic differentiation.”<br>
→ AD runs on <strong>values</strong> with recorded ops; it’s not symbolic algebra nor simple finite differences.</li>
</ul>

  </div>
</details>

📐 Automatic Differentiation (AD) — The Practical Engine

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Forward- vs Reverse-Mode AD</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <ul>
<li><strong>Forward-mode</strong>: push derivatives from inputs → outputs (good when inputs ≪ outputs).</li>
<li><strong>Reverse-mode</strong> (backprop): pull derivatives from scalar loss ← parameters (good when outputs ≪ parameters), which is typical in deep learning.</li>
</ul>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Why Graph Ordering Matters</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    Frameworks record a <strong>computation graph</strong> during the forward pass.<br>
Backward uses <strong>reverse topological order</strong> so each node’s gradient is computed only after all its dependents have provided upstream gradients—minimizing redundant work and memory thrash.
  </div>
</details>

🧩 Step 7: Mini Summary

🧠 What You Learned: Backpropagation is reverse-mode automatic differentiation powered by the chain rule; it assigns responsibility for the loss to every parameter.

⚙️ How It Works: Cache forward intermediates → apply local derivatives → multiply by incoming gradients → propagate backward in reverse topological order.

🎯 Why It Matters: This mechanism enables efficient training of modern deep networks by turning a global objective into local, tractable updates.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!