Linear Regression Coding Examples (Python & NumPy)

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

11 min read 2219 words

🧮 Linear Regression — From Scratch (Normal Equation & Gradient Descent) 🐍

ℹ️

“Implement a linear regression model with both (a) a closed-form Normal Equation solver and (b) a Gradient Descent optimizer. Provide fit, predict, and an MSE evaluation. Emphasize numerical stability and basic preprocessing (bias term, feature scaling for GD).”

# Linear Regression from scratch (NumPy)
# - Supports closed-form (Normal Equation) and batch Gradient Descent
# - Includes simple feature scaling for GD and an MSE evaluator

import numpy as np

class LinearRegressionScratch:
    def __init__(self, method="normal", learning_rate=1e-3, max_iter=1000, tol=1e-6, fit_intercept=True):
        """
        method: "normal" or "gd"
        learning_rate: step size for gradient descent
        max_iter: max iterations for GD
        tol: convergence tolerance for GD (based on parameter change)
        fit_intercept: whether to learn an intercept term
        """
        self.method = method
        self.lr = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.fit_intercept = fit_intercept
        self.coef_ = None  # shape (n_features,)
        self.intercept_ = 0.0
        # For GD scaling
        self.mu_ = None
        self.sigma_ = None

    def _add_bias(self, X):
        # Add a column of ones for intercept if required
        if self.fit_intercept:
            return np.hstack([np.ones((X.shape[0], 1)), X])
        return X

    def _prepare_for_gd(self, X):
        # Standardize features (important for GD convergence)
        self.mu_ = X.mean(axis=0)
        self.sigma_ = X.std(axis=0)
        # avoid divide-by-zero
        self.sigma_[self.sigma_ == 0] = 1.0
        return (X - self.mu_) / self.sigma_

    def fit(self, X, y):
        """
        Fit model to data.
        X: (m, n)
        y: (m,) or (m,1)
        """
        X = np.asarray(X)
        y = np.asarray(y).reshape(-1, 1)
        m, n = X.shape

        if self.method == "normal":
            # Use closed-form solution. Be careful with numerical stability:
            # Use (X^T X + lambda I) pseudo-inverse trick omitted here (lambda=0).
            Xb = self._add_bias(X)  # (m, n+1)
            # Compute (X^T X) inverse using pseudo-inverse for numerical stability
            XtX = Xb.T @ Xb  # (n+1, n+1)
            try:
                theta = np.linalg.inv(XtX) @ Xb.T @ y
            except np.linalg.LinAlgError:
                # fallback to pseudo-inverse when XtX is singular / ill-conditioned
                theta = np.linalg.pinv(XtX) @ Xb.T @ y
            if self.fit_intercept:
                self.intercept_ = float(theta[0])
                self.coef_ = theta[1:].reshape(-1)
            else:
                self.intercept_ = 0.0
                self.coef_ = theta.reshape(-1)
            return self

        elif self.method == "gd":
            # Gradient Descent (batch)
            Xg = self._prepare_for_gd(X)  # standardized
            Xb = self._add_bias(Xg)  # bias + scaled features
            # initialize theta small
            theta = np.zeros((Xb.shape[1], 1))
            prev_theta = theta.copy()
            for it in range(self.max_iter):
                # prediction error
                preds = Xb @ theta  # (m,1)
                error = preds - y
                # gradient: (1/m) * Xb^T * error
                grad = (Xb.T @ error) / m
                theta -= self.lr * grad
                # convergence check (L2 change)
                if np.linalg.norm(theta - prev_theta) < self.tol:
                    break
                prev_theta = theta.copy()
            # unpack
            if self.fit_intercept:
                self.intercept_ = float(theta[0])
                # remember we trained on scaled features => convert back to original space:
                scaled_coef = theta[1:].reshape(-1)
                # original coef = scaled_coef / sigma
                self.coef_ = (scaled_coef / self.sigma_).reshape(-1)
                # adjust intercept to account for scaling:
                # intercept_orig = intercept_scaled - sum(mu / sigma * scaled_coef)
                self.intercept_ = float(self.intercept_ - (self.mu_ @ (scaled_coef / self.sigma_)))
            else:
                self.intercept_ = 0.0
                self.coef_ = theta.reshape(-1)
            return self
        else:
            raise ValueError("Unknown method. Choose 'normal' or 'gd'.")

    def predict(self, X):
        X = np.asarray(X)
        return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)

    @staticmethod
    def mean_squared_error(y_true, y_pred):
        y_true = np.asarray(y_true).reshape(-1)
        y_pred = np.asarray(y_pred).reshape(-1)
        return float(((y_true - y_pred) ** 2).mean())

# Example usage:
if __name__ == "__main__":
    # small synthetic example
    np.random.seed(0)
    X = 2 * np.random.rand(100, 1)
    y = 4 + 3 * X[:, 0] + np.random.randn(100) * 0.5

    model_norm = LinearRegressionScratch(method="normal")
    model_norm.fit(X, y)
    preds_norm = model_norm.predict(X)
    print("Normal Eq coef:", model_norm.coef_, "intercept:", model_norm.intercept_)
    print("MSE (normal):", LinearRegressionScratch.mean_squared_error(y, preds_norm))

    model_gd = LinearRegressionScratch(method="gd", learning_rate=0.1, max_iter=2000)
    model_gd.fit(X, y)
    preds_gd = model_gd.predict(X)
    print("GD coef:", model_gd.coef_, "intercept:", model_gd.intercept_)
    print("MSE (gd):", LinearRegressionScratch.mean_squared_error(y, preds_gd))

# Scikit-learn comparison
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(100) * 0.5

lr = LinearRegression(fit_intercept=True)
lr.fit(X, y)
preds = lr.predict(X)
print("sklearn coef:", lr.coef_, "intercept:", lr.intercept_)
print("MSE (sklearn):", mean_squared_error(y, preds))

🔍 Interviewer’s Analysis

Time Complexity:
- Normal Equation: O(n^3 + n^2 m) dominated by inverting an (n+1)x(n+1) matrix where n = features. More precisely, forming X^T X costs O(m n^2) and inversion O(n^3).
- Gradient Descent (batch): O(k * m * n) where k = iterations; each iteration computes gradients over the whole dataset.
Space Complexity: O(m n) to store data, O(n^2) temporary for X^T X if using normal equation, O(n) for parameters.
Trade-offs & Discussion:
- Use Normal Equation for small n (features) and when you want an exact closed-form solution. Not suitable when n is large because of O(n^3) inversion cost and instability when X^T X is ill-conditioned.
- GD scales to large n and m and can be used with mini-batches or stochastic variants. It requires careful tuning of learning rate and feature scaling.
Common Pitfalls:
- Not scaling features for GD (very slow convergence or divergence).
- Ignoring numerical instability of matrix inversion — use pseudo-inverse or regularization.
- Mishandling shapes (column vs row vectors) and forgetting to add bias/intercept consistently.

🧾 Ridge Regression (L2 Regularization) — Implement & Compare 🔒

ℹ️

“Implement Ridge regression closed-form solution (X^T X + λI)^{-1} X^T y. Show how regularization helps with multicollinearity and ill-conditioned X^T X. Compare to scikit-learn’s Ridge.”

# Ridge Regression (closed form) from scratch
import numpy as np

class RidgeRegressionScratch:
    def __init__(self, alpha=1.0, fit_intercept=True):
        """
        alpha: regularization strength (lambda)
        fit_intercept: whether to learn intercept
        """
        self.alpha = alpha
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = 0.0

    def _add_bias(self, X):
        if self.fit_intercept:
            return np.hstack([np.ones((X.shape[0], 1)), X])
        return X

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y).reshape(-1, 1)
        Xb = self._add_bias(X)  # (m, n+1) if intercept
        n_features = Xb.shape[1]

        # Regularization matrix: do not regularize intercept term if fit_intercept
        L = self.alpha * np.eye(n_features)
        if self.fit_intercept:
            L[0, 0] = 0.0

        XtX = Xb.T @ Xb
        A = XtX + L
        try:
            theta = np.linalg.inv(A) @ Xb.T @ y
        except np.linalg.LinAlgError:
            theta = np.linalg.pinv(A) @ Xb.T @ y

        if self.fit_intercept:
            self.intercept_ = float(theta[0])
            self.coef_ = theta[1:].reshape(-1)
        else:
            self.intercept_ = 0.0
            self.coef_ = theta.reshape(-1)
        return self

    def predict(self, X):
        return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)

# Quick demo on multicollinear data
if __name__ == "__main__":
    np.random.seed(0)
    m = 100
    x_base = np.random.randn(m, 1)
    # create highly correlated features
    X = np.hstack([x_base, x_base + 1e-3 * np.random.randn(m, 1)])
    y = 2 * x_base[:,0] + np.random.randn(m) * 0.1

    ridge0 = RidgeRegressionScratch(alpha=0.0).fit(X, y)  # equivalent to OLS
    ridge1 = RidgeRegressionScratch(alpha=10.0).fit(X, y)  # regularized

    print("OLS coef:", ridge0.coef_, "intercept:", ridge0.intercept_)
    print("Ridge coef:", ridge1.coef_, "intercept:", ridge1.intercept_)

# sklearn Ridge comparison
import numpy as np
from sklearn.linear_model import Ridge
np.random.seed(0)
m = 100
x_base = np.random.randn(m, 1)
X = np.hstack([x_base, x_base + 1e-3 * np.random.randn(m, 1)])
y = 2 * x_base[:,0] + np.random.randn(m) * 0.1

r = Ridge(alpha=10.0, fit_intercept=True)
r.fit(X, y)
print("sklearn Ridge coef:", r.coef_, "intercept:", r.intercept_)

🔍 Interviewer’s Analysis

Time Complexity: Building X^T X is O(m n^2), and solving the linear system (XtX + αI)^{-1} is O(n^3) for direct inversion. Using linear solvers (e.g., Cholesky) can be O(n^3) but with better constants.
Space Complexity: O(m n) dataset storage plus O(n^2) for X^T X.
Trade-offs & Discussion:
- Regularization reduces variance and stabilizes inversion when features are collinear or when n is close to m.
- alpha must be tuned (CV). Too large -> high bias; too small -> little effect.
Common Pitfalls:
- Regularizing bias term inadvertently (common bug) — exclude intercept from penalty.
- Not standardizing features when comparing penalties (though closed-form ridge can handle unscaled data, interpretation of alpha depends on feature scale).

✨ Polynomial Features & Bias-Variance Exploration — Engineering for Non-Linearity 🔁

ℹ️

“Transform features into polynomial basis (e.g., degrees 1..d), fit linear regression (use regularization optionally), and demonstrate how degree affects bias and variance. Provide code to build polynomial features and evaluate using train/validation MSE.”

# Polynomial features + linear regression (from scratch)
import numpy as np
from sklearn.model_selection import train_test_split

def polynomial_features(X, degree):
    """
    X: (m, n)
    returns: (m, n * degree) including powers x, x^2, ..., x^degree for each feature
    """
    X = np.asarray(X)
    m, n = X.shape
    # stack feature^1 .. feature^degree
    feats = [X ** d for d in range(1, degree + 1)]
    return np.hstack(feats)

# Reuse the previous LinearRegressionScratch (normal method) class definition here
# For brevity, assume we have it imported or defined in the same file as LinearRegressionScratch.

if __name__ == "__main__":
    np.random.seed(0)
    m = 200
    X = np.linspace(-3, 3, m).reshape(-1, 1)
    # true underlying non-linear function
    y = 0.5 * X[:,0]**3 - X[:,0]**2 + 2 * X[:,0] + np.random.randn(m) * 3.0

    # try degrees 1, 3, 9
    for deg in [1, 3, 9]:
        X_poly = polynomial_features(X, degree=deg)
        X_train, X_val, y_train, y_val = train_test_split(X_poly, y, test_size=0.2, random_state=42)
        model = LinearRegressionScratch(method="normal")  # closed-form on polynomial basis
        model.fit(X_train, y_train)
        train_mse = LinearRegressionScratch.mean_squared_error(y_train, model.predict(X_train))
        val_mse = LinearRegressionScratch.mean_squared_error(y_val, model.predict(X_val))
        print(f"Degree={deg}: train_mse={train_mse:.3f}, val_mse={val_mse:.3f}")

# sklearn Pipeline for polynomial + ridge & CV
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

np.random.seed(0)
m = 200
X = np.linspace(-3, 3, m).reshape(-1, 1)
y = 0.5 * X[:,0]**3 - X[:,0]**2 + 2 * X[:,0] + np.random.randn(m) * 3.0

for deg in [1, 3, 9]:
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    pipe = make_pipeline(PolynomialFeatures(degree=deg, include_bias=False),
                         StandardScaler(),
                         Ridge(alpha=1.0))
    pipe.fit(X_train, y_train)
    train_mse = ((y_train - pipe.predict(X_train))**2).mean()
    val_mse = ((y_val - pipe.predict(X_val))**2).mean()
    print(f"sklearn Degree={deg}: train_mse={train_mse:.3f}, val_mse={val_mse:.3f}")

🔍 Interviewer’s Analysis

Time Complexity: Polynomial expansion increases feature count: new n' = n * degree (for simple per-feature powers). Closed-form then costs O(n'^3). Be mindful: degree grows quickly and can cause dimensional explosion.
Space Complexity: O(m n') to store expanded features.
Trade-offs & Discussion:
- Polynomial basis allows linear models to capture non-linear relationships but increases variance. Regularization (Ridge/Lasso) and cross-validation are necessary to control overfitting.
- Use domain knowledge to limit degree or consider kernel methods (e.g., kernel ridge) or tree-based models for complex non-linearities.
Common Pitfalls:
- Not centering/scaling polynomial features — high-degree terms can dominate numerically.
- Ignoring interaction terms or multicollinearity introduced by polynomial powers.

🔧 Robust Regression (Outlier Resistance) — Huber Loss vs RANSAC 🛡️

ℹ️

“Implement Huber regression optimization (using gradient descent) to show robustness to outliers. Provide a short comparison with RANSAC (sklearn) and discuss when each is preferable.”

# Huber regression via gradient descent (convex, piecewise loss)
import numpy as np

class HuberRegressionGD:
    def __init__(self, delta=1.0, lr=1e-3, max_iter=2000, tol=1e-6, fit_intercept=True):
        self.delta = delta  # threshold between quadratic and linear
        self.lr = lr
        self.max_iter = max_iter
        self.tol = tol
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = 0.0

    def _add_bias(self, X):
        if self.fit_intercept:
            return np.hstack([np.ones((X.shape[0], 1)), X])
        return X

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y).reshape(-1,1)
        Xb = self._add_bias(X)
        m, p = Xb.shape
        theta = np.zeros((p,1))

        for it in range(self.max_iter):
            preds = Xb @ theta
            resid = preds - y  # (m,1)
            abs_r = np.abs(resid)
            # Huber gradient: if |r| <= delta -> r ; else -> delta * sign(r)
            mask = abs_r <= self.delta
            # derivative w.r.t residual
            dL_dr = np.where(mask, resid, self.delta * np.sign(resid))
            # gradient w.r.t theta: Xb^T * dL_dr / m
            grad = (Xb.T @ dL_dr) / m
            theta_new = theta - self.lr * grad
            if np.linalg.norm(theta_new - theta) < self.tol:
                theta = theta_new
                break
            theta = theta_new
        if self.fit_intercept:
            self.intercept_ = float(theta[0])
            self.coef_ = theta[1:].reshape(-1)
        else:
            self.intercept_ = 0.0
            self.coef_ = theta.reshape(-1)
        return self

    def predict(self, X):
        return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)

# Demo with outliers
if __name__ == "__main__":
    np.random.seed(0)
    m = 100
    X = 2 * np.random.rand(m,1) - 1
    y = 3 * X[:,0] + 0.1 * np.random.randn(m)
    # add large outliers
    y[[5, 10, 20]] += np.array([20.0, -15.0, 30.0])

    # OLS baseline (from scratch)
    from copy import deepcopy
    ols = LinearRegressionScratch(method="normal").fit(X, y)
    huber = HuberRegressionGD(delta=1.0, lr=0.1, max_iter=5000).fit(X, y)

    print("OLS coef:", ols.coef_, "intercept:", ols.intercept_)
    print("Huber coef:", huber.coef_, "intercept:", huber.intercept_)

# RANSAC using sklearn for robust linear regression
import numpy as np
from sklearn.linear_model import RANSACRegressor, LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(0)
m = 100
X = 2 * np.random.rand(m,1) - 1
y = 3 * X[:,0] + 0.1 * np.random.randn(m)
y[[5, 10, 20]] += np.array([20.0, -15.0, 30.0])  # outliers

base_estimator = LinearRegression()
ransac = RANSACRegressor(base_estimator=base_estimator, min_samples=0.5, residual_threshold=2.0, random_state=0)
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
print("RANSAC coef:", ransac.estimator_.coef_, "intercept:", ransac.estimator_.intercept_)

# Evaluate
ols = LinearRegression().fit(X, y)
print("OLS MSE:", mean_squared_error(y, ols.predict(X)))
print("RANSAC MSE:", mean_squared_error(y, ransac.predict(X)))

🔍 Interviewer’s Analysis

Time Complexity:
- Huber GD: O(k * m * n) for k iterations, similar to GD.
- RANSAC: O(t * s * n) where t = number of random trials, s = sample size per trial; worst-case can be expensive but often works fast on small models.
Space Complexity: O(m n) for data; model parameters O(n).
Trade-offs & Discussion:
- Huber loss provides a smooth transition between MSE (quadratic) for small residuals and MAE (linear) for large residuals — robust and convex, therefore optimizable with GD. Good when outliers exist but not too many.
- RANSAC is non-deterministic and finds a model that fits a consensus set (inliers). It can tolerate a large fraction of outliers but may fail if inliers are not well-represented or if the ratio of outliers is high.
Common Pitfalls:
- For Huber: choosing delta poorly — too small behaves like MAE (less sensitive to small errors), too large reverts to MSE. Learning rate tuning is required.
- For RANSAC: poor choices of residual_threshold or min_samples lead to poor inlier selection; randomness means repeatability requires setting random_state.

Linear Regression Interview Cheatsheet & Quick Reference Generalized Linear Models (GLMs): Linear Regression