Scratch Implementation in Python: Linear Regression

5 min read 860 words

🪄 Step 1: Intuition & Motivation

Core Idea: So far, we’ve talked theory — what Linear Regression is and how it learns. Now, let’s translate that knowledge into an actual implementation using NumPy. You’ll see how math becomes code: from prediction equations to gradient updates.
Simple Analogy: Think of this as teaching a robot to draw your best-fit line by hand. You show it the formula for error, explain how to adjust its line every step, and then just let it practice — improving each time it redraws the line a little closer to perfect.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

We want to train a Linear Regression model from scratch.
That means — no scikit-learn magic, no helper functions — just raw NumPy operations that mimic how the math works.

The process follows three main steps:

Initialize parameters ($\beta$): Start with random or zero weights.
Iteratively update weights: Using Gradient Descent — adjust $\beta$ to reduce the cost (MSE).
Predict: Once trained, use $\hat{y} = X\beta$ to make predictions.

Each code line you’ll write is a direct translation of a mathematical expression you’ve already learned.

Why It Works This Way

The whole point of implementing from scratch is to understand how learning happens.
Libraries like sklearn.LinearRegression hide the details — they either use the closed-form OLS solution or iterative gradient-based updates.

When you code it manually, you’re watching learning unfold step by step:

You compute loss.
You compute its slope (gradient).
You take a step in the opposite direction.
The loss shrinks — your line fits better.
That’s ML’s heartbeat in its purest form.

How It Fits in ML Thinking

This “math-to-code” connection is the gateway to understanding optimization in neural networks, too.
Every backpropagation algorithm in deep learning is built on the exact same principles — compute gradients, update parameters, and repeat until convergence.

📐 Step 3: Mathematical Foundation

We’ll map each math formula to its corresponding code operation.

Prediction Step

$$ \hat{y} = X\beta $$

In code:

y_pred = X.dot(beta)

$X$: matrix of input features.
$\beta$: vector of coefficients (weights).
$y_{pred}$: predicted outputs.

The dot product combines features and weights — summing up each feature’s contribution to predict an outcome.

Loss Function (MSE)

$$ J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$

In code:

loss = np.mean((y - y_pred) ** 2)

Measures average squared error between actual and predicted values.
This gives the “pain score” of your model. Smaller loss = happier model.

Gradient Calculation

The gradient of MSE w.r.t. $\beta$ is:

$$ \frac{\partial J}{\partial \beta} = -\frac{2}{n} X^T (y - \hat{y}) $$

In code:

grad = -(2/n) * X.T.dot(y - y_pred)

It tells us how much to adjust each $\beta$ to reduce the loss.
Think of the gradient as a GPS pointing uphill — we move in the opposite direction (downhill) to minimize loss.

Weight Update Rule

$$ \beta := \beta - \eta \frac{\partial J}{\partial \beta} $$

In code:

beta -= lr * grad

$\eta$ (lr): learning rate — controls how big a step we take each iteration.
Each iteration, you tweak the slope and intercept slightly — like micro-adjusting your ruler until it fits perfectly through the cloud of points.

🧠 Step 4: Implementation Logic (Step-by-Step Pseudocode)

Here’s the entire flow explained in plain English, before we even touch code.

1️⃣ Start with initial weights:

Set all $\beta$ to zero (or small random numbers).

2️⃣ Loop through many iterations:

Compute predictions using $\hat{y} = X\beta$.
Calculate the loss (MSE).
Compute gradient: $\frac{\partial J}{\partial \beta} = -\frac{2}{n}X^T(y - \hat{y})$.
Update weights: $\beta := \beta - \eta \cdot \text{gradient}$.
Optionally print the loss to track improvement.

3️⃣ Stop when convergence:

Loss changes very little (or after fixed iterations).

4️⃣ Use final $\beta$ for prediction.

This is exactly what every ML model does — even the fancy ones.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Helps understand what libraries do internally.
Great for debugging your intuition about learning and convergence.
Easy to visualize improvement over iterations.

Naive Python loops are slow — prefer vectorized NumPy operations.
Learning rate tuning is still necessary.
Might diverge if you forget normalization or use bad initialization.

Full control (you see every equation unfold)
vs.
Full automation (libraries abstract everything away).
Mastery means knowing when to code from scratch and when to use a toolbox.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“It’s okay to skip feature scaling.”
Not for Gradient Descent! Different scales make convergence painfully slow.
“More iterations = better model.”
Not always — if your learning rate is too high, extra iterations just make it diverge faster.
“The loss should always decrease smoothly.”
Tiny oscillations are normal, especially with larger learning rates or noisy data.

🧩 Step 7: Mini Summary

🧠 What You Learned: You now know how to code Linear Regression from scratch — transforming math into NumPy logic line by line.

⚙️ How It Works: Predict → measure loss → compute gradient → update weights → repeat until happy.

🎯 Why It Matters: This hands-on insight demystifies “learning” — the same principle that drives neural networks, gradient boosting, and every optimization-based ML model.

Top Linear Regression Interview Questions (Practice)Scaling Solutions: Linear Regression