Scratch Implementation in Python: Linear Regression
🪄 Step 1: Intuition & Motivation
Core Idea: So far, we’ve talked theory — what Linear Regression is and how it learns. Now, let’s translate that knowledge into an actual implementation using NumPy. You’ll see how math becomes code: from prediction equations to gradient updates.
Simple Analogy: Think of this as teaching a robot to draw your best-fit line by hand. You show it the formula for error, explain how to adjust its line every step, and then just let it practice — improving each time it redraws the line a little closer to perfect.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
We want to train a Linear Regression model from scratch.
That means — no scikit-learn magic, no helper functions — just raw NumPy operations that mimic how the math works.
The process follows three main steps:
- Initialize parameters ($\beta$): Start with random or zero weights.
- Iteratively update weights: Using Gradient Descent — adjust $\beta$ to reduce the cost (MSE).
- Predict: Once trained, use $\hat{y} = X\beta$ to make predictions.
Each code line you’ll write is a direct translation of a mathematical expression you’ve already learned.
Why It Works This Way
The whole point of implementing from scratch is to understand how learning happens.
Libraries like sklearn.LinearRegression hide the details — they either use the closed-form OLS solution or iterative gradient-based updates.
When you code it manually, you’re watching learning unfold step by step:
- You compute loss.
- You compute its slope (gradient).
- You take a step in the opposite direction.
- The loss shrinks — your line fits better.
That’s ML’s heartbeat in its purest form.
How It Fits in ML Thinking
Every backpropagation algorithm in deep learning is built on the exact same principles — compute gradients, update parameters, and repeat until convergence.
📐 Step 3: Mathematical Foundation
We’ll map each math formula to its corresponding code operation.
`
Prediction Step
$$ \hat{y} = X\beta $$In code:
y_pred = X.dot(beta)- $X$: matrix of input features.
- $\beta$: vector of coefficients (weights).
- $y_{pred}$: predicted outputs.
Loss Function (MSE)
$$ J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$In code:
loss = np.mean((y - y_pred) ** 2)- Measures average squared error between actual and predicted values.This gives the “pain score” of your model. Smaller loss = happier model.
Gradient Calculation
The gradient of MSE w.r.t. $\beta$ is:
$$ \frac{\partial J}{\partial \beta} = -\frac{2}{n} X^T (y - \hat{y}) $$In code:
grad = -(2/n) * X.T.dot(y - y_pred)- It tells us how much to adjust each $\beta$ to reduce the loss.Think of the gradient as a GPS pointing uphill — we move in the opposite direction (downhill) to minimize loss.
Weight Update Rule
$$ \beta := \beta - \eta \frac{\partial J}{\partial \beta} $$In code:
beta -= lr * grad- $\eta$ (lr): learning rate — controls how big a step we take each iteration.Each iteration, you tweak the slope and intercept slightly — like micro-adjusting your ruler until it fits perfectly through the cloud of points.
🧠 Step 4: Implementation Logic (Step-by-Step Pseudocode)
Here’s the entire flow explained in plain English, before we even touch code.
1️⃣ Start with initial weights:
- Set all $\beta$ to zero (or small random numbers).
2️⃣ Loop through many iterations:
- Compute predictions using $\hat{y} = X\beta$.
- Calculate the loss (MSE).
- Compute gradient: $\frac{\partial J}{\partial \beta} = -\frac{2}{n}X^T(y - \hat{y})$.
- Update weights: $\beta := \beta - \eta \cdot \text{gradient}$.
- Optionally print the loss to track improvement.
3️⃣ Stop when convergence:
- Loss changes very little (or after fixed iterations).
4️⃣ Use final $\beta$ for prediction.
This is exactly what every ML model does — even the fancy ones.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Helps understand what libraries do internally.
- Great for debugging your intuition about learning and convergence.
- Easy to visualize improvement over iterations.
- Naive Python loops are slow — prefer vectorized NumPy operations.
- Learning rate tuning is still necessary.
- Might diverge if you forget normalization or use bad initialization.
vs.
Full automation (libraries abstract everything away).
Mastery means knowing when to code from scratch and when to use a toolbox.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“It’s okay to skip feature scaling.”
Not for Gradient Descent! Different scales make convergence painfully slow.“More iterations = better model.”
Not always — if your learning rate is too high, extra iterations just make it diverge faster.“The loss should always decrease smoothly.”
Tiny oscillations are normal, especially with larger learning rates or noisy data.
🧩 Step 7: Mini Summary
🧠 What You Learned: You now know how to code Linear Regression from scratch — transforming math into NumPy logic line by line.
⚙️ How It Works: Predict → measure loss → compute gradient → update weights → repeat until happy.
🎯 Why It Matters: This hands-on insight demystifies “learning” — the same principle that drives neural networks, gradient boosting, and every optimization-based ML model.