Scratch Implementation in Python: Linear Regression
🎯 Core Idea
Implementing linear regression from scratch using NumPy means writing the fit
(training) and predict
(inference) methods yourself—without relying on scikit-learn. The essence is:
- Fit: learn parameters $\mathbf{w}, b$ by minimizing the loss function using gradient descent.
- Predict: use learned parameters to estimate outputs.
This exercise ensures you truly understand the mechanics—how math translates into code.
🌱 Intuition & Real-World Analogy
Think of this as teaching a robot to throw darts at a target:
- Each dart = prediction.
- Target = true value.
- Miss distance = error (loss).
- Robot adjusts hand position slightly after each throw (gradient update) until darts consistently land near the bullseye.
Another analogy: cooking by tasting. You try a recipe, taste it (loss), adjust seasoning (parameters), repeat. Gradient descent is just systematic tasting and adjusting.
📐 Mathematical Foundation
Model:
$$ \hat{y} = X \mathbf{w} + b $$- $X \in \mathbb{R}^{n \times d}$: feature matrix (n samples, d features)
- $\mathbf{w} \in \mathbb{R}^{d}$: weight vector
- $b \in \mathbb{R}$: bias
- $\hat{y} \in \mathbb{R}^{n}$: predictions
Loss Function (MSE):
$$ J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^n \left( \hat{y}_i - y_i \right)^2 $$Factor $\frac{1}{2n}$ simplifies derivative math.
Gradients:
- Gradient w.r.t. weights:
- Gradient w.r.t. bias:
Update Rule: For learning rate $\eta$:
$$ \mathbf{w} := \mathbf{w} - \eta \cdot \nabla_{\mathbf{w}} J $$$$ b := b - \eta \cdot \nabla_{b} J $$This is the heart of gradient descent. Each line in your code should map directly to one of these equations.
📉 Step-by-Step Gradient Derivation
- Expand loss: $J = \frac{1}{2n}|X\mathbf{w} + b - y|^2$
- Differentiate wrt $\mathbf{w}$: use chain rule → $\frac{1}{n}X^\top(X\mathbf{w}+b-y)$
- Differentiate wrt $b$: → $\frac{1}{n}\sum (\hat{y}_i - y_i)$
- Plug into update rule.
⚖️ Strengths, Limitations & Trade-offs
Strengths:
- Forces you to understand the mechanics, not just use libraries.
- NumPy vectorization makes training efficient.
- Debugging loss progression teaches convergence behavior.
Limitations:
- Not production-ready (scikit-learn or PyTorch is far more optimized).
- Numerical stability issues if features not normalized.
- Sensitive to learning rate $\eta$: too high → divergence, too low → slow.
Trade-offs:
-
Batch vs. Stochastic Gradient Descent:
- Batch GD: stable, but expensive for large datasets.
- Stochastic GD: faster updates, noisier convergence.
- Mini-batch GD: compromise—common in practice.
🔍 Variants & Extensions
-
Stochastic Gradient Descent (SGD): Updates after each sample.
-
Mini-Batch Gradient Descent: Updates on chunks of data.
-
Momentum / Adam Optimizers: Speed up convergence with adaptive updates.
-
Closed-form Solution (Normal Equation):
$$ \mathbf{w} = (X^\top X)^{-1}X^\top y $$Efficient for small d, but impractical for high-dimensional data due to matrix inversion cost.
🚧 Common Challenges & Pitfalls
-
Naive Python loops: Inefficient—always prefer vectorized NumPy operations.
-
Unscaled features: Large magnitude features dominate updates → normalize/standardize.
-
Debugging: If loss isn’t decreasing, check:
- Learning rate too high/low.
- Gradient formula signs.
- Data preprocessing (bias term handling).
-
Overfitting: If weights grow large, consider adding regularization (L2 penalty).
📚 Reference Pointers
- Andrew Ng’s Notes on Gradient Descent
- Koller & Friedman’s PGM Text (for deeper optimization background)
- Wikipedia: Gradient Descent
- Deep Learning Book (Goodfellow et al.) — Chapter 8: Optimization