Scratch Implementation in Python: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

3 min read 508 words

🎯 Core Idea

Implementing linear regression from scratch using NumPy means writing the fit (training) and predict (inference) methods yourself—without relying on scikit-learn. The essence is:

Fit: learn parameters $\mathbf{w}, b$ by minimizing the loss function using gradient descent.
Predict: use learned parameters to estimate outputs.

This exercise ensures you truly understand the mechanics—how math translates into code.

🌱 Intuition & Real-World Analogy

Think of this as teaching a robot to throw darts at a target:

Each dart = prediction.
Target = true value.
Miss distance = error (loss).
Robot adjusts hand position slightly after each throw (gradient update) until darts consistently land near the bullseye.

Another analogy: cooking by tasting. You try a recipe, taste it (loss), adjust seasoning (parameters), repeat. Gradient descent is just systematic tasting and adjusting.

📐 Mathematical Foundation

Model:

$$ \hat{y} = X \mathbf{w} + b $$

$X \in \mathbb{R}^{n \times d}$: feature matrix (n samples, d features)
$\mathbf{w} \in \mathbb{R}^{d}$: weight vector
$b \in \mathbb{R}$: bias
$\hat{y} \in \mathbb{R}^{n}$: predictions

Loss Function (MSE):

$$ J(\mathbf{w}, b) = \frac{1}{2n} \sum_{i=1}^n \left( \hat{y}_i - y_i \right)^2 $$

Factor $\frac{1}{2n}$ simplifies derivative math.

Gradients:

Gradient w.r.t. weights:

$$ \nabla_{\mathbf{w}} J = \frac{1}{n} X^\top (X \mathbf{w} + b - y) $$

Gradient w.r.t. bias:

$$ \nabla_{b} J = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i) $$

Update Rule: For learning rate $\eta$:

$$ \mathbf{w} := \mathbf{w} - \eta \cdot \nabla_{\mathbf{w}} J $$$$ b := b - \eta \cdot \nabla_{b} J $$

This is the heart of gradient descent. Each line in your code should map directly to one of these equations.

📉 Step-by-Step Gradient Derivation

Expand loss: $J = \frac{1}{2n}|X\mathbf{w} + b - y|^2$
Differentiate wrt $\mathbf{w}$: use chain rule → $\frac{1}{n}X^\top(X\mathbf{w}+b-y)$
Differentiate wrt $b$: → $\frac{1}{n}\sum (\hat{y}_i - y_i)$
Plug into update rule.

⚖️ Strengths, Limitations & Trade-offs

Strengths:

Forces you to understand the mechanics, not just use libraries.
NumPy vectorization makes training efficient.
Debugging loss progression teaches convergence behavior.

Limitations:

Not production-ready (scikit-learn or PyTorch is far more optimized).
Numerical stability issues if features not normalized.
Sensitive to learning rate $\eta$: too high → divergence, too low → slow.

Trade-offs:

Batch vs. Stochastic Gradient Descent:
- Batch GD: stable, but expensive for large datasets.
- Stochastic GD: faster updates, noisier convergence.
- Mini-batch GD: compromise—common in practice.

🔍 Variants & Extensions

Stochastic Gradient Descent (SGD): Updates after each sample.
Mini-Batch Gradient Descent: Updates on chunks of data.
Momentum / Adam Optimizers: Speed up convergence with adaptive updates.
Closed-form Solution (Normal Equation):
$$ \mathbf{w} = (X^\top X)^{-1}X^\top y $$
Efficient for small d, but impractical for high-dimensional data due to matrix inversion cost.

🚧 Common Challenges & Pitfalls

Naive Python loops: Inefficient—always prefer vectorized NumPy operations.
Unscaled features: Large magnitude features dominate updates → normalize/standardize.
Debugging: If loss isn’t decreasing, check:
- Learning rate too high/low.
- Gradient formula signs.
- Data preprocessing (bias term handling).
Overfitting: If weights grow large, consider adding regularization (L2 penalty).

📚 Reference Pointers

Andrew Ng’s Notes on Gradient Descent
Koller & Friedman’s PGM Text (for deeper optimization background)
Wikipedia: Gradient Descent
Deep Learning Book (Goodfellow et al.) — Chapter 8: Optimization

Top Linear Regression Interview Questions (Practice)Scaling Solutions: Linear Regression