Scratch Implementation in Python: Linear Regression

3 min read 508 words

🎯 Core Idea

Implementing linear regression from scratch using NumPy means writing the fit (training) and predict (inference) methods yourself—without relying on scikit-learn. The essence is:

  • Fit: learn parameters $\mathbf{w}, b$ by minimizing the loss function using gradient descent.
  • Predict: use learned parameters to estimate outputs.

This exercise ensures you truly understand the mechanics—how math translates into code.


🌱 Intuition & Real-World Analogy

Think of this as teaching a robot to throw darts at a target:

  • Each dart = prediction.
  • Target = true value.
  • Miss distance = error (loss).
  • Robot adjusts hand position slightly after each throw (gradient update) until darts consistently land near the bullseye.

Another analogy: cooking by tasting. You try a recipe, taste it (loss), adjust seasoning (parameters), repeat. Gradient descent is just systematic tasting and adjusting.


📐 Mathematical Foundation

Model:

$$ \hat{y} = X \mathbf{w} + b $$
  • $X \in \mathbb{R}^{n \times d}$: feature matrix (n samples, d features)
  • $\mathbf{w} \in \mathbb{R}^{d}$: weight vector
  • $b \in \mathbb{R}$: bias
  • $\hat{y} \in \mathbb{R}^{n}$: predictions

Loss Function (MSE):

$$ J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^n \left( \hat{y}_i - y_i \right)^2 $$

Factor $\frac{1}{2n}$ simplifies derivative math.

Gradients:

  • Gradient w.r.t. weights:
$$ \nabla_{\mathbf{w}} J = \frac{1}{n} X^\top (X \mathbf{w} + b - y) $$
  • Gradient w.r.t. bias:
$$ \nabla_{b} J = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i) $$

Update Rule: For learning rate $\eta$:

$$ \mathbf{w} := \mathbf{w} - \eta \cdot \nabla_{\mathbf{w}} J $$$$ b := b - \eta \cdot \nabla_{b} J $$

This is the heart of gradient descent. Each line in your code should map directly to one of these equations.

📉 Step-by-Step Gradient Derivation
  1. Expand loss: $J = \frac{1}{2n}|X\mathbf{w} + b - y|^2$
  2. Differentiate wrt $\mathbf{w}$: use chain rule → $\frac{1}{n}X^\top(X\mathbf{w}+b-y)$
  3. Differentiate wrt $b$: → $\frac{1}{n}\sum (\hat{y}_i - y_i)$
  4. Plug into update rule.

⚖️ Strengths, Limitations & Trade-offs

Strengths:

  • Forces you to understand the mechanics, not just use libraries.
  • NumPy vectorization makes training efficient.
  • Debugging loss progression teaches convergence behavior.

Limitations:

  • Not production-ready (scikit-learn or PyTorch is far more optimized).
  • Numerical stability issues if features not normalized.
  • Sensitive to learning rate $\eta$: too high → divergence, too low → slow.

Trade-offs:

  • Batch vs. Stochastic Gradient Descent:

    • Batch GD: stable, but expensive for large datasets.
    • Stochastic GD: faster updates, noisier convergence.
    • Mini-batch GD: compromise—common in practice.

🔍 Variants & Extensions

  • Stochastic Gradient Descent (SGD): Updates after each sample.

  • Mini-Batch Gradient Descent: Updates on chunks of data.

  • Momentum / Adam Optimizers: Speed up convergence with adaptive updates.

  • Closed-form Solution (Normal Equation):

    $$ \mathbf{w} = (X^\top X)^{-1}X^\top y $$

    Efficient for small d, but impractical for high-dimensional data due to matrix inversion cost.


🚧 Common Challenges & Pitfalls

  • Naive Python loops: Inefficient—always prefer vectorized NumPy operations.

  • Unscaled features: Large magnitude features dominate updates → normalize/standardize.

  • Debugging: If loss isn’t decreasing, check:

    • Learning rate too high/low.
    • Gradient formula signs.
    • Data preprocessing (bias term handling).
  • Overfitting: If weights grow large, consider adding regularization (L2 penalty).


📚 Reference Pointers

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!