1.3 Train Using Gradient Descent (From Scratch)

5 min read 881 words

🪄 Step 1: Intuition & Motivation

Core Idea: You’ve now met Logistic Regression’s “brain” — the log-likelihood function. Now it’s time to learn how the model actually learns.

That’s where Gradient Descent enters the stage — the ultimate self-improvement algorithm.

In simple terms, Gradient Descent is like a blindfolded mountain climber who wants to find the lowest valley (minimum loss). He doesn’t see the full mountain, but he can feel the slope — the gradient — and takes small steps downhill until he reaches the bottom.

Simple Analogy: Imagine you’re on a hill at night with a flashlight. You can’t see the entire terrain, but you can sense which direction slopes downward. You take a small step in that direction… and repeat. Eventually, you find the lowest point — that’s how your Logistic Regression finds the best parameters ($\beta$).

🌱 Step 2: Core Concept

Let’s walk step-by-step through how Logistic Regression learns using Gradient Descent.

What’s Happening Under the Hood?

Every training iteration does three key things:

Predict: Compute the predicted probability for each data point using the sigmoid function: $\hat{y} = \frac{1}{1 + e^{-X\beta}}$
Measure Error: Use the cost function (negative log-likelihood) to measure how far predictions are from truth: $J(\beta) = -\frac{1}{m} \sum [y\log(\hat{y}) + (1 - y)\log(1 - \hat{y})]$
Update Parameters: Adjust each weight $\beta_j$ in the direction that reduces the cost: $\beta := \beta - \alpha \cdot \frac{\partial J}{\partial \beta}$

That’s one epoch — one full pass over the data. Repeat this process until the cost stops decreasing or changes only slightly.

Why It Works This Way

Gradient Descent works because it follows the steepest downward direction in the loss landscape. The gradient $\frac{\partial J}{\partial \beta_j}$ tells us which way is uphill — so we subtract it to go downhill.

The learning rate ($\alpha$) controls how big each step is:

Too large → You might overshoot the minimum and bounce endlessly.
Too small → You’ll crawl painfully slow toward convergence.

Good practice: Start with small $\alpha$ (like 0.01), monitor convergence, and adjust dynamically if needed.

How It Fits in ML Thinking

Gradient Descent is the heartbeat of modern machine learning — from simple Logistic Regression to massive neural networks with billions of parameters.

In Logistic Regression, it’s especially elegant because:

The cost function is convex (a single valley),
So Gradient Descent is guaranteed to reach the best solution.

This makes it a perfect “sandbox” to learn the fundamentals before tackling deep learning.

📐 Step 3: Mathematical Foundation

Let’s revisit the math gently — focusing on meaning, not memorization.

Gradient Descent Update Rule

$$ \beta := \beta - \alpha \cdot \frac{1}{m} X^T (\hat{y} - y) $$

$\beta$ → parameter vector (weights)
$\alpha$ → learning rate
$X^T$ → transpose of feature matrix (to align dimensions)
$(\hat{y} - y)$ → error vector (difference between predicted and actual labels)

This is the vectorized update — one elegant equation that updates all parameters simultaneously.

Think of it like “nudging” each weight in proportion to how much it contributed to the error — the bigger the mistake, the stronger the adjustment.

Batch vs Stochastic vs Mini-Batch

There are three ways to compute these updates:

Type	Data Used per Step	Pros	Cons
Batch GD	All training samples	Stable convergence	Slow on large data
Stochastic GD (SGD)	One sample at a time	Fast updates, can escape local traps	Noisy updates
Mini-Batch GD	Small random subset (e.g., 32 samples)	Best balance between speed & stability	Needs tuning of batch size

Batch = slow but smooth, SGD = fast but chaotic, Mini-batch = the happy middle ground.

🧠 Step 4: Assumptions or Key Ideas

The cost function must be differentiable (which it is).
The learning rate stays small enough to ensure stable convergence.
Features are typically scaled (standardized), so no single feature dominates updates.
Data should be randomized to avoid biased updates (especially in mini-batch or SGD).

⚖️ Step 5: Strengths, Limitations & Trade-offs

Simple yet powerful optimization approach.
Works efficiently on convex problems like Logistic Regression.
Forms the foundation of deep learning optimizers (e.g., Adam, RMSProp).

Sensitive to learning rate choice — too high or too low both cause trouble.
May converge slowly if data is not scaled or features vary widely.
Batch GD is computationally expensive for large datasets.

Gradient Descent is a trade-off between speed and stability. It’s like walking downhill — taking huge leaps (fast but risky) or tiny careful steps (safe but slow). Tuning $\alpha$ and batch size decides how gracefully you descend the loss valley.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

❌ “Gradient Descent always finds the best minimum.” → Only true for convex functions like in Logistic Regression. Not guaranteed in neural networks.
❌ “Bigger learning rate means faster training.” → Often causes oscillation or divergence instead.
❌ “Feature scaling is optional.” → Without it, some features dominate updates, making convergence painfully slow.

🧩 Step 7: Mini Summary

🧠 What You Learned: Gradient Descent is the learning engine of Logistic Regression — it iteratively updates parameters to minimize the cost.

⚙️ How It Works: It uses the gradient (slope) of the cost surface to take small steps toward the minimum.

🎯 Why It Matters: Mastering Gradient Descent builds the foundation for understanding optimization in deep learning, SVMs, and beyond.

1.4 Interpretability and Coefficients 1.2 Dive into the Cost Function — The Log-Likelihood