4. Implement Gradient Descent from Scratch
🪄 Step 1: Intuition & Motivation
Core Idea: Understanding Gradient Descent conceptually is great — but implementing it solidifies your intuition. You’ll see how math becomes movement: how weights actually shift, how the loss decreases, and what happens when your learning rate misbehaves.
Simple Analogy: Imagine you’re teaching a robot to find the lowest point in a valley. You give it eyes (the cost function), legs (the update rule), and speed control (the learning rate). Implementing Gradient Descent is like writing that robot’s “move downhill” program — so it can learn by itself.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Each iteration of Gradient Descent does four key things:
- Predict: Compute model outputs $h_\theta(x)$ using the current parameters.
- Compare: Measure how far off these predictions are from the actual labels.
- Compute Gradient: Calculate how each parameter affects the total error.
- Update: Move parameters in the direction that reduces the cost.
Then, repeat this loop until the cost stops decreasing (convergence).
Why It Works This Way
How It Fits in ML Thinking
📐 Step 3: Mathematical Foundation
Let’s restate the key equations in action form.
Linear Regression (MSE Loss)
Predictions:
$$ h_\theta(X) = X\theta $$Loss Function:
$$ J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y) $$Gradient:
$$ \nabla_\theta J(\theta) = \frac{1}{m} X^T(X\theta - y) $$Update Rule:
$$ \theta := \theta - \alpha \nabla_\theta J(\theta) $$Logistic Regression (Cross-Entropy Loss)
Predictions:
$$ h_\theta(X) = \sigma(X\theta) = \frac{1}{1 + e^{-X\theta}} $$Loss Function:
$$ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i\log h_\theta(x_i) + (1 - y_i)\log(1 - h_\theta(x_i))] $$Gradient:
$$ \nabla_\theta J(\theta) = \frac{1}{m} X^T(h_\theta(X) - y) $$Update Rule:
$$ \theta := \theta - \alpha \nabla_\theta J(\theta) $$🧠 Step 4: Assumptions or Key Ideas
- Feature scaling is crucial: large-magnitude features dominate the gradient and slow convergence.
- Learning rate $\alpha$ must be tuned experimentally — start small, visualize the loss curve.
- Initialization of $\theta$ can impact speed but not the final solution (for convex losses).
- Use vectorized math — operations on entire arrays rather than Python loops — for massive speed-ups.
for i in range(m): grad += (h[i]-y[i])*x[i]
with
grad = X.T @ (h - y)
is a game-changer.⚖️ Step 5: Strengths, Limitations & Trade-offs
- Easy to implement — solidifies theory.
- Works for both Linear and Logistic Regression.
- Vectorization scales well for mid-size datasets.
- Slow on massive datasets (millions of samples).
- Choosing $\alpha$ is tricky — bad choices can stall or explode.
- Sensitive to poorly scaled features or extreme outliers.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“More iterations always improve accuracy.” Not true — after convergence, updates may start bouncing due to noise or learning rate issues.
“Gradient Descent is slow because of math.” It’s slow due to poor implementation — unvectorized loops or unscaled features.
“The loss curve must always go down smoothly.” Not necessarily! Small oscillations are normal — they show learning in progress, especially in stochastic settings.
🧩 Step 7: Mini Summary
🧠 What You Learned: You’ve now turned Gradient Descent from a formula into a working loop — predicting, comparing, correcting, and repeating until the model stabilizes.
⚙️ How It Works: Each iteration computes the gradient ($X^T(X\theta - y)$ or $X^T(h_\theta(X) - y)$) and updates $\theta$ by stepping opposite the slope, scaled by $\alpha$.
🎯 Why It Matters: Implementing it once builds intuition that no framework can replace — understanding how a model learns makes you a far stronger ML engineer.