2. Derive the Gradient Descent Update Rule

6 min read 1196 words

🪄 Step 1: Intuition & Motivation

Core Idea (in 1 short paragraph):
To make a model better, we need a rule that says how to nudge the parameters so the cost goes down. The gradient tells us the slope of the cost with respect to each parameter — like a compass pointing uphill. Stepping in the opposite direction by a small amount $,\alpha,$ moves us downhill. Repeat this gently and you reach the bottom.
Simple Analogy (only if needed):
Picture adjusting a shower knob for the perfect temperature. You twist a little (a small update), feel if it got hotter or colder (the gradient sign), then twist the other way if needed. The learning rate $,\alpha,$ is how big each twist is.

🌱 Step 2: Core Concept

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">What’s Happening Under the Hood?</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p>We compute how sensitive the cost is to each parameter (these sensitivities are the partial derivatives).<br>
Then we reduce each parameter by a small fraction of that sensitivity:
</p>
$$ \theta_j := \theta_j - \alpha \,\frac{\partial J(\theta)}{\partial \theta_j}. $$<p>
Do this for all parameters, recompute the cost, and repeat — that’s the learning loop.</p>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Why It Works This Way</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    The gradient points in the direction of <em>steepest increase</em> of the cost.<br>
Stepping in the opposite direction decreases the cost fastest for a tiny step size — a core result from calculus and geometry of functions.
  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" open>
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">How It Fits in ML Thinking</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    This update rule is the backbone of training: define a cost → compute gradient → update parameters → repeat.<br>
Linear and logistic regression are friendly arenas to learn this before moving on to larger models.
  </div>
</details>

📐 Step 3: Mathematical Foundation

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">General Gradient Descent Update</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    $$ \theta_j := \theta_j - \alpha \,\frac{\partial J(\theta)}{\partial \theta_j} $$<ul>
<li>$\theta_j$: the $j$-th parameter (including bias if present).</li>
<li>$\alpha$: learning rate (step size).</li>
<li>$\frac{\partial J}{\partial \theta_j}$: how much the cost changes if we nudge $\theta_j$ a tiny bit.</li>
</ul>
<div class="hx-overflow-x-auto hx-mt-6 hx-flex hx-rounded-lg hx-border hx-py-2 ltr:hx-pr-4 rtl:hx-pl-4 contrast-more:hx-border-current contrast-more:dark:hx-border-current hx-border-orange-100 hx-bg-orange-50 hx-text-orange-800 dark:hx-border-orange-400/30 dark:hx-bg-orange-400/20 dark:hx-text-orange-300">
  <div class="ltr:hx-pl-3 ltr:hx-pr-2 rtl:hx-pr-3 rtl:hx-pl-2"></div>
  <div class="hx-w-full hx-min-w-0 hx-leading-7">
    <div class="hx-mt-6 hx-leading-7 first:hx-mt-0">Think “cost thermometer”: a big positive derivative means “you’re too high — step down more”; a small derivative means “you’re nearly there — step lightly.”</div>
  </div>
</div>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Linear Regression Gradient (with MSE)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p><strong>Cost:</strong><br>
</p>
$$ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}\big(h_\theta(x_i)-y_i\big)^2,\quad h_\theta(x_i)=\theta^\top x_i. $$<p><strong>Partial derivative:</strong><br>
</p>
$$ \frac{\partial J}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x_i)-y_i\big)\,x_{ij}. $$<ul>
<li>Each term $\big(h_\theta(x_i)-y_i\big)$ is the prediction error.</li>
<li>Multiply by the feature $x_{ij}$ to attribute that error to parameter $\theta_j$.</li>
<li>Average over $m$ to keep updates stable across dataset size.</li>
</ul>
<div class="hx-overflow-x-auto hx-mt-6 hx-flex hx-rounded-lg hx-border hx-py-2 ltr:hx-pr-4 rtl:hx-pl-4 contrast-more:hx-border-current contrast-more:dark:hx-border-current hx-border-orange-100 hx-bg-orange-50 hx-text-orange-800 dark:hx-border-orange-400/30 dark:hx-bg-orange-400/20 dark:hx-text-orange-300">
  <div class="ltr:hx-pl-3 ltr:hx-pr-2 rtl:hx-pr-3 rtl:hx-pl-2"></div>
  <div class="hx-w-full hx-min-w-0 hx-leading-7">
    <div class="hx-mt-6 hx-leading-7 first:hx-mt-0">If a feature $x_{ij}$ is often positive when the model overpredicts, the gradient for its weight will be positive — telling us to <em>decrease</em> that weight.</div>
  </div>
</div>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Logistic Regression Gradient (with Cross-Entropy)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <p><strong>Model &amp; cost:</strong><br>
</p>
$$ h_\theta(x_i)=\sigma(\theta^\top x_i),\quad \sigma(z)=\frac{1}{1+e^{-z}}, $$<p>
</p>
$$ J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\Big[y_i\log h_\theta(x_i)+(1-y_i)\log\big(1-h_\theta(x_i)\big)\Big]. $$<p><strong>Partial derivative (via chain rule):</strong><br>
</p>
$$ \frac{\partial J}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x_i)-y_i\big)\,x_{ij}. $$<ul>
<li>Same <em>form</em> as linear regression, but $h_\theta(x_i)$ is a <strong>probability</strong> from the sigmoid.</li>
<li>The chain rule collapses neatly so the residual becomes $(\text{predicted prob} - \text{label})$.</li>
</ul>
<div class="hx-overflow-x-auto hx-mt-6 hx-flex hx-rounded-lg hx-border hx-py-2 ltr:hx-pr-4 rtl:hx-pl-4 contrast-more:hx-border-current contrast-more:dark:hx-border-current hx-border-orange-100 hx-bg-orange-50 hx-text-orange-800 dark:hx-border-orange-400/30 dark:hx-bg-orange-400/20 dark:hx-text-orange-300">
  <div class="ltr:hx-pl-3 ltr:hx-pr-2 rtl:hx-pr-3 rtl:hx-pl-2"></div>
  <div class="hx-w-full hx-min-w-0 hx-leading-7">
    <div class="hx-mt-6 hx-leading-7 first:hx-mt-0">If the model says “$0.9$” but the truth is “$0$”, the residual $0.9-0$ is large — the gradient asks for a strong corrective nudge.</div>
  </div>
</div>

  </div>
</details>


<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">Why the $\tfrac{1}{m}$ Scaling?</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    $$ \frac{1}{m}\sum_{i=1}^{m}(\cdots) $$<ul>
<li>Normalizes the gradient so its scale doesn’t explode with dataset size.</li>
<li>Makes learning rate $\alpha$ meaningfully comparable across datasets.</li>
<li>Produces smoother, more stable updates.</li>
</ul>

  </div>
</details>

🧠 Step 4: Assumptions or Key Ideas (if applicable)

- The cost is differentiable w.r.t. each parameter.
- For these linear models, the cost surface is convex → one global minimum (no “getting stuck” issue).
- Step size $\alpha$ must be small enough to avoid overshooting but large enough to make progress.

⚖️ Step 5: Strengths, Limitations & Trade-offs


  



Simple, universal recipe for improving parameters.
Same template works for many models (just change $h_\theta$ and $J$).
Averaging over $m$ stabilizes updates.




Poorly chosen $\alpha$ can stall or diverge.
Unscaled features can slow convergence dramatically.
Sensitive to numerical issues (e.g., sigmoid extremes).



Choose $\alpha$ like choosing a walking pace on a slope: too fast → you trip (oscillate/diverge); too slow → you crawl (slow learning). Feature scaling often “flattens” the terrain so a wider range of $\alpha$ works.

🚧 Step 6: Common Misunderstandings (Optional)

<details class="last-of-type:hx-mb-0 hx-rounded-lg hx-bg-neutral-50 dark:hx-bg-neutral-800 hx-p-2 hx-mt-4 hx-group" >
  <summary class="hx-flex hx-items-center hx-cursor-pointer hx-select-none hx-list-none hx-p-1 hx-rounded hx-transition-colors hover:hx-bg-gray-100 dark:hover:hx-bg-neutral-800 before:hx-mr-1 before:hx-inline-block before:hx-transition-transform before:hx-content-[''] dark:before:hx-invert rtl:before:hx-rotate-180 group-open:before:hx-rotate-90">
    <strong class="hx-text-lg">🚨 Common Misunderstandings (Click to Expand)</strong>
  </summary>
  <div class="hx-p-2 hx-overflow-hidden">
    <ul>
<li>
<p><strong>“Oscillation means I need more iterations.”</strong><br>
If your loss bounces up and down, the fix is usually <strong>smaller $\alpha$</strong>, not more loops.</p>
</li>
<li>
<p><strong>“Linear and logistic gradients must look different.”</strong><br>
They share the same residual form $h_\theta(x)-y$ because of how the derivatives align with MSE and cross-entropy.</p>
</li>
<li>
<p><strong>“The $1/m$ is cosmetic.”</strong><br>
It affects stability and the effective step size — removing it makes $\alpha$ dataset-size dependent.</p>
</li>
</ul>

  </div>
</details>

🧩 Step 7: Mini Summary

🧠 What You Learned:
The update rule moves each parameter in the direction that reduces the cost, scaled by a careful step size $\alpha$.

⚙️ How It Works:
Compute partial derivatives of the cost, average over data, then apply $\theta_j \leftarrow \theta_j - \alpha,\frac{\partial J}{\partial \theta_j}$.

🎯 Why It Matters:
This rule is the engine of learning — get the gradient and step size right, and your model improves predictably.

3. Connect Learning Rate to Convergence Dynamics 10. Review and Internalize Conceptual Links