Linear Regression Coding Examples (Python & NumPy)
๐งฎ Linear Regression โ From Scratch (Normal Equation & Gradient Descent) ๐
fit
, predict
, and an MSE evaluation. Emphasize numerical stability and basic preprocessing (bias term, feature scaling for GD).”# Linear Regression from scratch (NumPy)
# - Supports closed-form (Normal Equation) and batch Gradient Descent
# - Includes simple feature scaling for GD and an MSE evaluator
import numpy as np
class LinearRegressionScratch:
def __init__(self, method="normal", learning_rate=1e-3, max_iter=1000, tol=1e-6, fit_intercept=True):
"""
method: "normal" or "gd"
learning_rate: step size for gradient descent
max_iter: max iterations for GD
tol: convergence tolerance for GD (based on parameter change)
fit_intercept: whether to learn an intercept term
"""
self.method = method
self.lr = learning_rate
self.max_iter = max_iter
self.tol = tol
self.fit_intercept = fit_intercept
self.coef_ = None # shape (n_features,)
self.intercept_ = 0.0
# For GD scaling
self.mu_ = None
self.sigma_ = None
def _add_bias(self, X):
# Add a column of ones for intercept if required
if self.fit_intercept:
return np.hstack([np.ones((X.shape[0], 1)), X])
return X
def _prepare_for_gd(self, X):
# Standardize features (important for GD convergence)
self.mu_ = X.mean(axis=0)
self.sigma_ = X.std(axis=0)
# avoid divide-by-zero
self.sigma_[self.sigma_ == 0] = 1.0
return (X - self.mu_) / self.sigma_
def fit(self, X, y):
"""
Fit model to data.
X: (m, n)
y: (m,) or (m,1)
"""
X = np.asarray(X)
y = np.asarray(y).reshape(-1, 1)
m, n = X.shape
if self.method == "normal":
# Use closed-form solution. Be careful with numerical stability:
# Use (X^T X + lambda I) pseudo-inverse trick omitted here (lambda=0).
Xb = self._add_bias(X) # (m, n+1)
# Compute (X^T X) inverse using pseudo-inverse for numerical stability
XtX = Xb.T @ Xb # (n+1, n+1)
try:
theta = np.linalg.inv(XtX) @ Xb.T @ y
except np.linalg.LinAlgError:
# fallback to pseudo-inverse when XtX is singular / ill-conditioned
theta = np.linalg.pinv(XtX) @ Xb.T @ y
if self.fit_intercept:
self.intercept_ = float(theta[0])
self.coef_ = theta[1:].reshape(-1)
else:
self.intercept_ = 0.0
self.coef_ = theta.reshape(-1)
return self
elif self.method == "gd":
# Gradient Descent (batch)
Xg = self._prepare_for_gd(X) # standardized
Xb = self._add_bias(Xg) # bias + scaled features
# initialize theta small
theta = np.zeros((Xb.shape[1], 1))
prev_theta = theta.copy()
for it in range(self.max_iter):
# prediction error
preds = Xb @ theta # (m,1)
error = preds - y
# gradient: (1/m) * Xb^T * error
grad = (Xb.T @ error) / m
theta -= self.lr * grad
# convergence check (L2 change)
if np.linalg.norm(theta - prev_theta) < self.tol:
break
prev_theta = theta.copy()
# unpack
if self.fit_intercept:
self.intercept_ = float(theta[0])
# remember we trained on scaled features => convert back to original space:
scaled_coef = theta[1:].reshape(-1)
# original coef = scaled_coef / sigma
self.coef_ = (scaled_coef / self.sigma_).reshape(-1)
# adjust intercept to account for scaling:
# intercept_orig = intercept_scaled - sum(mu / sigma * scaled_coef)
self.intercept_ = float(self.intercept_ - (self.mu_ @ (scaled_coef / self.sigma_)))
else:
self.intercept_ = 0.0
self.coef_ = theta.reshape(-1)
return self
else:
raise ValueError("Unknown method. Choose 'normal' or 'gd'.")
def predict(self, X):
X = np.asarray(X)
return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)
@staticmethod
def mean_squared_error(y_true, y_pred):
y_true = np.asarray(y_true).reshape(-1)
y_pred = np.asarray(y_pred).reshape(-1)
return float(((y_true - y_pred) ** 2).mean())
# Example usage:
if __name__ == "__main__":
# small synthetic example
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(100) * 0.5
model_norm = LinearRegressionScratch(method="normal")
model_norm.fit(X, y)
preds_norm = model_norm.predict(X)
print("Normal Eq coef:", model_norm.coef_, "intercept:", model_norm.intercept_)
print("MSE (normal):", LinearRegressionScratch.mean_squared_error(y, preds_norm))
model_gd = LinearRegressionScratch(method="gd", learning_rate=0.1, max_iter=2000)
model_gd.fit(X, y)
preds_gd = model_gd.predict(X)
print("GD coef:", model_gd.coef_, "intercept:", model_gd.intercept_)
print("MSE (gd):", LinearRegressionScratch.mean_squared_error(y, preds_gd))
# Scikit-learn comparison
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(100) * 0.5
lr = LinearRegression(fit_intercept=True)
lr.fit(X, y)
preds = lr.predict(X)
print("sklearn coef:", lr.coef_, "intercept:", lr.intercept_)
print("MSE (sklearn):", mean_squared_error(y, preds))
๐ Interviewer’s Analysis
-
Time Complexity:
- Normal Equation:
O(n^3 + n^2 m)
dominated by inverting an (n+1)x(n+1) matrix wheren
= features. More precisely, formingX^T X
costsO(m n^2)
and inversionO(n^3)
. - Gradient Descent (batch):
O(k * m * n)
wherek
= iterations; each iteration computes gradients over the whole dataset.
- Normal Equation:
-
Space Complexity:
O(m n)
to store data,O(n^2)
temporary forX^T X
if using normal equation,O(n)
for parameters. -
Trade-offs & Discussion:
- Use Normal Equation for small
n
(features) and when you want an exact closed-form solution. Not suitable whenn
is large because ofO(n^3)
inversion cost and instability whenX^T X
is ill-conditioned. - GD scales to large
n
andm
and can be used with mini-batches or stochastic variants. It requires careful tuning of learning rate and feature scaling.
- Use Normal Equation for small
-
Common Pitfalls:
- Not scaling features for GD (very slow convergence or divergence).
- Ignoring numerical instability of matrix inversion โ use pseudo-inverse or regularization.
- Mishandling shapes (column vs row vectors) and forgetting to add bias/intercept consistently.
๐งพ Ridge Regression (L2 Regularization) โ Implement & Compare ๐
(X^T X + ฮปI)^{-1} X^T y
. Show how regularization helps with multicollinearity and ill-conditioned X^T X
. Compare to scikit-learn’s Ridge
.”# Ridge Regression (closed form) from scratch
import numpy as np
class RidgeRegressionScratch:
def __init__(self, alpha=1.0, fit_intercept=True):
"""
alpha: regularization strength (lambda)
fit_intercept: whether to learn intercept
"""
self.alpha = alpha
self.fit_intercept = fit_intercept
self.coef_ = None
self.intercept_ = 0.0
def _add_bias(self, X):
if self.fit_intercept:
return np.hstack([np.ones((X.shape[0], 1)), X])
return X
def fit(self, X, y):
X = np.asarray(X)
y = np.asarray(y).reshape(-1, 1)
Xb = self._add_bias(X) # (m, n+1) if intercept
n_features = Xb.shape[1]
# Regularization matrix: do not regularize intercept term if fit_intercept
L = self.alpha * np.eye(n_features)
if self.fit_intercept:
L[0, 0] = 0.0
XtX = Xb.T @ Xb
A = XtX + L
try:
theta = np.linalg.inv(A) @ Xb.T @ y
except np.linalg.LinAlgError:
theta = np.linalg.pinv(A) @ Xb.T @ y
if self.fit_intercept:
self.intercept_ = float(theta[0])
self.coef_ = theta[1:].reshape(-1)
else:
self.intercept_ = 0.0
self.coef_ = theta.reshape(-1)
return self
def predict(self, X):
return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)
# Quick demo on multicollinear data
if __name__ == "__main__":
np.random.seed(0)
m = 100
x_base = np.random.randn(m, 1)
# create highly correlated features
X = np.hstack([x_base, x_base + 1e-3 * np.random.randn(m, 1)])
y = 2 * x_base[:,0] + np.random.randn(m) * 0.1
ridge0 = RidgeRegressionScratch(alpha=0.0).fit(X, y) # equivalent to OLS
ridge1 = RidgeRegressionScratch(alpha=10.0).fit(X, y) # regularized
print("OLS coef:", ridge0.coef_, "intercept:", ridge0.intercept_)
print("Ridge coef:", ridge1.coef_, "intercept:", ridge1.intercept_)
# sklearn Ridge comparison
import numpy as np
from sklearn.linear_model import Ridge
np.random.seed(0)
m = 100
x_base = np.random.randn(m, 1)
X = np.hstack([x_base, x_base + 1e-3 * np.random.randn(m, 1)])
y = 2 * x_base[:,0] + np.random.randn(m) * 0.1
r = Ridge(alpha=10.0, fit_intercept=True)
r.fit(X, y)
print("sklearn Ridge coef:", r.coef_, "intercept:", r.intercept_)
๐ Interviewer’s Analysis
-
Time Complexity: Building
X^T X
isO(m n^2)
, and solving the linear system(XtX + ฮฑI)^{-1}
isO(n^3)
for direct inversion. Using linear solvers (e.g., Cholesky) can beO(n^3)
but with better constants. -
Space Complexity:
O(m n)
dataset storage plusO(n^2)
forX^T X
. -
Trade-offs & Discussion:
- Regularization reduces variance and stabilizes inversion when features are collinear or when
n
is close tom
. alpha
must be tuned (CV). Too large -> high bias; too small -> little effect.
- Regularization reduces variance and stabilizes inversion when features are collinear or when
-
Common Pitfalls:
- Regularizing bias term inadvertently (common bug) โ exclude intercept from penalty.
- Not standardizing features when comparing penalties (though closed-form ridge can handle unscaled data, interpretation of
alpha
depends on feature scale).
โจ Polynomial Features & Bias-Variance Exploration โ Engineering for Non-Linearity ๐
# Polynomial features + linear regression (from scratch)
import numpy as np
from sklearn.model_selection import train_test_split
def polynomial_features(X, degree):
"""
X: (m, n)
returns: (m, n * degree) including powers x, x^2, ..., x^degree for each feature
"""
X = np.asarray(X)
m, n = X.shape
# stack feature^1 .. feature^degree
feats = [X ** d for d in range(1, degree + 1)]
return np.hstack(feats)
# Reuse the previous LinearRegressionScratch (normal method) class definition here
# For brevity, assume we have it imported or defined in the same file as LinearRegressionScratch.
if __name__ == "__main__":
np.random.seed(0)
m = 200
X = np.linspace(-3, 3, m).reshape(-1, 1)
# true underlying non-linear function
y = 0.5 * X[:,0]**3 - X[:,0]**2 + 2 * X[:,0] + np.random.randn(m) * 3.0
# try degrees 1, 3, 9
for deg in [1, 3, 9]:
X_poly = polynomial_features(X, degree=deg)
X_train, X_val, y_train, y_val = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegressionScratch(method="normal") # closed-form on polynomial basis
model.fit(X_train, y_train)
train_mse = LinearRegressionScratch.mean_squared_error(y_train, model.predict(X_train))
val_mse = LinearRegressionScratch.mean_squared_error(y_val, model.predict(X_val))
print(f"Degree={deg}: train_mse={train_mse:.3f}, val_mse={val_mse:.3f}")
# sklearn Pipeline for polynomial + ridge & CV
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
np.random.seed(0)
m = 200
X = np.linspace(-3, 3, m).reshape(-1, 1)
y = 0.5 * X[:,0]**3 - X[:,0]**2 + 2 * X[:,0] + np.random.randn(m) * 3.0
for deg in [1, 3, 9]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = make_pipeline(PolynomialFeatures(degree=deg, include_bias=False),
StandardScaler(),
Ridge(alpha=1.0))
pipe.fit(X_train, y_train)
train_mse = ((y_train - pipe.predict(X_train))**2).mean()
val_mse = ((y_val - pipe.predict(X_val))**2).mean()
print(f"sklearn Degree={deg}: train_mse={train_mse:.3f}, val_mse={val_mse:.3f}")
๐ Interviewer’s Analysis
-
Time Complexity: Polynomial expansion increases feature count: new
n' = n * degree
(for simple per-feature powers). Closed-form then costsO(n'^3)
. Be mindful: degree grows quickly and can cause dimensional explosion. -
Space Complexity:
O(m n')
to store expanded features. -
Trade-offs & Discussion:
- Polynomial basis allows linear models to capture non-linear relationships but increases variance. Regularization (Ridge/Lasso) and cross-validation are necessary to control overfitting.
- Use domain knowledge to limit degree or consider kernel methods (e.g., kernel ridge) or tree-based models for complex non-linearities.
-
Common Pitfalls:
- Not centering/scaling polynomial features โ high-degree terms can dominate numerically.
- Ignoring interaction terms or multicollinearity introduced by polynomial powers.
๐ง Robust Regression (Outlier Resistance) โ Huber Loss vs RANSAC ๐ก๏ธ
# Huber regression via gradient descent (convex, piecewise loss)
import numpy as np
class HuberRegressionGD:
def __init__(self, delta=1.0, lr=1e-3, max_iter=2000, tol=1e-6, fit_intercept=True):
self.delta = delta # threshold between quadratic and linear
self.lr = lr
self.max_iter = max_iter
self.tol = tol
self.fit_intercept = fit_intercept
self.coef_ = None
self.intercept_ = 0.0
def _add_bias(self, X):
if self.fit_intercept:
return np.hstack([np.ones((X.shape[0], 1)), X])
return X
def fit(self, X, y):
X = np.asarray(X)
y = np.asarray(y).reshape(-1,1)
Xb = self._add_bias(X)
m, p = Xb.shape
theta = np.zeros((p,1))
for it in range(self.max_iter):
preds = Xb @ theta
resid = preds - y # (m,1)
abs_r = np.abs(resid)
# Huber gradient: if |r| <= delta -> r ; else -> delta * sign(r)
mask = abs_r <= self.delta
# derivative w.r.t residual
dL_dr = np.where(mask, resid, self.delta * np.sign(resid))
# gradient w.r.t theta: Xb^T * dL_dr / m
grad = (Xb.T @ dL_dr) / m
theta_new = theta - self.lr * grad
if np.linalg.norm(theta_new - theta) < self.tol:
theta = theta_new
break
theta = theta_new
if self.fit_intercept:
self.intercept_ = float(theta[0])
self.coef_ = theta[1:].reshape(-1)
else:
self.intercept_ = 0.0
self.coef_ = theta.reshape(-1)
return self
def predict(self, X):
return (X @ self.coef_.reshape(-1,1) + self.intercept_).reshape(-1)
# Demo with outliers
if __name__ == "__main__":
np.random.seed(0)
m = 100
X = 2 * np.random.rand(m,1) - 1
y = 3 * X[:,0] + 0.1 * np.random.randn(m)
# add large outliers
y[[5, 10, 20]] += np.array([20.0, -15.0, 30.0])
# OLS baseline (from scratch)
from copy import deepcopy
ols = LinearRegressionScratch(method="normal").fit(X, y)
huber = HuberRegressionGD(delta=1.0, lr=0.1, max_iter=5000).fit(X, y)
print("OLS coef:", ols.coef_, "intercept:", ols.intercept_)
print("Huber coef:", huber.coef_, "intercept:", huber.intercept_)
# RANSAC using sklearn for robust linear regression
import numpy as np
from sklearn.linear_model import RANSACRegressor, LinearRegression
from sklearn.metrics import mean_squared_error
np.random.seed(0)
m = 100
X = 2 * np.random.rand(m,1) - 1
y = 3 * X[:,0] + 0.1 * np.random.randn(m)
y[[5, 10, 20]] += np.array([20.0, -15.0, 30.0]) # outliers
base_estimator = LinearRegression()
ransac = RANSACRegressor(base_estimator=base_estimator, min_samples=0.5, residual_threshold=2.0, random_state=0)
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
print("RANSAC coef:", ransac.estimator_.coef_, "intercept:", ransac.estimator_.intercept_)
# Evaluate
ols = LinearRegression().fit(X, y)
print("OLS MSE:", mean_squared_error(y, ols.predict(X)))
print("RANSAC MSE:", mean_squared_error(y, ransac.predict(X)))
๐ Interviewer’s Analysis
-
Time Complexity:
- Huber GD:
O(k * m * n)
fork
iterations, similar to GD. - RANSAC:
O(t * s * n)
wheret
= number of random trials,s
= sample size per trial; worst-case can be expensive but often works fast on small models.
- Huber GD:
-
Space Complexity:
O(m n)
for data; model parametersO(n)
. -
Trade-offs & Discussion:
- Huber loss provides a smooth transition between MSE (quadratic) for small residuals and MAE (linear) for large residuals โ robust and convex, therefore optimizable with GD. Good when outliers exist but not too many.
- RANSAC is non-deterministic and finds a model that fits a consensus set (inliers). It can tolerate a large fraction of outliers but may fail if inliers are not well-represented or if the ratio of outliers is high.
-
Common Pitfalls:
- For Huber: choosing
delta
poorly โ too small behaves like MAE (less sensitive to small errors), too large reverts to MSE. Learning rate tuning is required. - For RANSAC: poor choices of
residual_threshold
ormin_samples
lead to poor inlier selection; randomness means repeatability requires settingrandom_state
.
- For Huber: choosing