4. Implement Gradient Descent from Scratch

4 min read 763 words

🪄 Step 1: Intuition & Motivation

Core Idea: Understanding Gradient Descent conceptually is great — but implementing it solidifies your intuition. You’ll see how math becomes movement: how weights actually shift, how the loss decreases, and what happens when your learning rate misbehaves.
Simple Analogy: Imagine you’re teaching a robot to find the lowest point in a valley. You give it eyes (the cost function), legs (the update rule), and speed control (the learning rate). Implementing Gradient Descent is like writing that robot’s “move downhill” program — so it can learn by itself.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Each iteration of Gradient Descent does four key things:

Predict: Compute model outputs $h_\theta(x)$ using the current parameters.
Compare: Measure how far off these predictions are from the actual labels.
Compute Gradient: Calculate how each parameter affects the total error.
Update: Move parameters in the direction that reduces the cost.

Then, repeat this loop until the cost stops decreasing (convergence).

Why It Works This Way

Every update step pushes the parameters toward a configuration that makes predictions slightly less wrong. Even though each individual step is tiny, repeating this process hundreds or thousands of times transforms the model from “clueless” to “accurate.”

How It Fits in ML Thinking

Gradient Descent implementation bridges theory and engineering. It’s not just knowing what the equation says — it’s about making the machine actually learn. That’s why implementing this once from scratch builds an intuition that frameworks like PyTorch or TensorFlow later make automatic.

📐 Step 3: Mathematical Foundation

Let’s restate the key equations in action form.

Linear Regression (MSE Loss)

Predictions:

$$ h_\theta(X) = X\theta $$

Loss Function:

$$ J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y) $$

Gradient:

$$ \nabla_\theta J(\theta) = \frac{1}{m} X^T(X\theta - y) $$

Update Rule:

$$ \theta := \theta - \alpha \nabla_\theta J(\theta) $$

We’re just adjusting each weight in proportion to how much it contributes to the prediction error. Vectorization ($X^T(X\theta - y)$) lets us compute all gradients in one shot — no loops needed!

Logistic Regression (Cross-Entropy Loss)

Predictions:

$$ h_\theta(X) = \sigma(X\theta) = \frac{1}{1 + e^{-X\theta}} $$

Loss Function:

$$ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i\log h_\theta(x_i) + (1 - y_i)\log(1 - h_\theta(x_i))] $$

Gradient:

$$ \nabla_\theta J(\theta) = \frac{1}{m} X^T(h_\theta(X) - y) $$

Update Rule:

$$ \theta := \theta - \alpha \nabla_\theta J(\theta) $$

For logistic regression, the “error” term is not just $(h_\theta - y)$ — it’s probability-based, pushing weights based on confidence and correctness.

🧠 Step 4: Assumptions or Key Ideas

Feature scaling is crucial: large-magnitude features dominate the gradient and slow convergence.
Learning rate $\alpha$ must be tuned experimentally — start small, visualize the loss curve.
Initialization of $\theta$ can impact speed but not the final solution (for convex losses).
Use vectorized math — operations on entire arrays rather than Python loops — for massive speed-ups.

ℹ️

Python loops operate element by element, but NumPy executes operations in optimized C — up to 100x faster. That’s why replacing for i in range(m): grad += (h[i]-y[i])*x[i] with grad = X.T @ (h - y) is a game-changer.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Easy to implement — solidifies theory.
Works for both Linear and Logistic Regression.
Vectorization scales well for mid-size datasets.

Slow on massive datasets (millions of samples).
Choosing $\alpha$ is tricky — bad choices can stall or explode.
Sensitive to poorly scaled features or extreme outliers.

Analytical solutions (like Normal Equation $ \theta = (X^TX)^{-1}X^Ty $) are exact but computationally expensive for large $n$. Gradient Descent trades exactness for efficiency — it iterates fast, even if approximate.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“More iterations always improve accuracy.” Not true — after convergence, updates may start bouncing due to noise or learning rate issues.
“Gradient Descent is slow because of math.” It’s slow due to poor implementation — unvectorized loops or unscaled features.
“The loss curve must always go down smoothly.” Not necessarily! Small oscillations are normal — they show learning in progress, especially in stochastic settings.

🧩 Step 7: Mini Summary

🧠 What You Learned: You’ve now turned Gradient Descent from a formula into a working loop — predicting, comparing, correcting, and repeating until the model stabilizes.

⚙️ How It Works: Each iteration computes the gradient ($X^T(X\theta - y)$ or $X^T(h_\theta(X) - y)$) and updates $\theta$ by stepping opposite the slope, scaled by $\alpha$.

🎯 Why It Matters: Implementing it once builds intuition that no framework can replace — understanding how a model learns makes you a far stronger ML engineer.

5. Interpret Convergence and Stopping Criteria 3. Connect Learning Rate to Convergence Dynamics