Reproducible Experiment: Linear Regression from Scratch
Overview (what you’ll learn)
- How to implement ordinary linear regression from first principles (closed-form + gradient descent) using NumPy.
- How to generate synthetic data with controlled noise and reproducible randomness.
- How to save experiment configuration & seed, and log train/val MSE.
- Best practices and subtle pitfalls for reproducible experiments (randomness sources, numeric stability, BLAS nondeterminism).
- Key math derivations and short proofs (collapsible).
Quick checklist (for interviews / hands-on)
- Fix seeds for all random sources (
numpy
,python random
, optionallytorch
if used). - Save experiment config (YAML/JSON) containing seed, dataset params, optimizer settings, eval split.
- Generate synthetic data deterministically from seed; save dataset or its generation parameters.
- Implement both normal equation (closed-form) and gradient descent as alternatives; validate they match (within numeric error).
- Log MSE (train & val) every epoch/step; save final model parameters and metrics as JSON/CSV.
- Record environment info (Python/Numpy versions, BLAS threads / MKL/OpenBLAS details).
- Ensure deterministic compute (set OMP_NUM_THREADS, MKL_NUM_THREADS, or mention non-determinism caveat).
🎯 Core Idea
Implement and run a reproducible linear regression experiment (NumPy from-scratch). Reproducibility means: given the saved config + seed + code + environment, another person (or CI) can re-run and obtain the same numerical results (modulo tiny floating-point differences).
🌱 Intuition & Real-World Analogy
-
Why synthetic data with controlled noise? It isolates algorithmic correctness: you control the data generation (true weights, noise level) so you can verify whether your estimator recovers the underlying parameters and how noise affects MSE.
-
Analogies
- Building a recipe and writing down exact oven temperature and ingredient brands — so anyone following it gets the same cake. Seed=config = the recipe, logging = tasting notes.
- Calibrating a scale with known weights: synthetic data are the known weights, algorithm is the scale — if it reads correctly, it’s calibrated.
📐 Mathematical Foundation
Problem statement (linear regression)
Given dataset $(X, y)$ with $X \in \mathbb{R}^{n\times d}$ (design matrix) and $y\in\mathbb{R}^n$, model:
$$ \hat{y} = Xw + b \quad \text{(or augment \(X\) with a column of ones and use } \tilde{w}). $$We’ll use the augmented form: $\tilde X = [\mathbf{1}, X]$, and $\theta \in \mathbb{R}^{d+1}$:
$$ \hat{y} = \tilde X \theta. $$Loss: Mean Squared Error (MSE)
$$ \mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i - \tilde X_i \theta)^2 = \frac{1}{n} \lVert y - \tilde X \theta \rVert_2^2. $$-
Break down:
- $n$: number of samples.
- $\tilde X_i$: i-th row (1 × (d+1)).
- $\theta$: parameter vector (d+1 × 1).
Closed-form solution (Normal equation)
Minimize $\mathcal{L}$: set gradient to zero:
$$ \nabla_\theta \mathcal{L} = -\frac{2}{n}\tilde X^\top (y - \tilde X \theta) = 0 \Rightarrow \tilde X^\top \tilde X \theta = \tilde X^\top y. $$Solution:
$$ \theta^* = (\tilde X^\top \tilde X)^{-1}\tilde X^\top y. $$- Use pseudoinverse when $\tilde X^\top \tilde X$ is singular:
Gradient descent update
Gradient:
$$ \nabla_\theta \mathcal{L} = -\frac{2}{n}\tilde X^\top (y - \tilde X \theta). $$Update (learning rate $\eta$):
$$ \theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} = \theta + \frac{2\eta}{n}\tilde X^\top (y - \tilde X \theta). $$Batch/mini-batch variants apply similarly (replace full gradient with minibatch).
Regularized (Ridge) regression
$$ \mathcal{L}_\lambda(\theta) = \frac{1}{n} \lVert y - \tilde X\theta \rVert^2_2 + \lambda \lVert \theta_{1:} \rVert_2^2 $$(normal equation)
$$ \theta^* = (\tilde X^\top \tilde X + n\lambda I)^{-1} \tilde X^\top y. $$(Usually bias term not regularized; then add zeros for bias index.)
⚖️ Strengths, Limitations & Trade-offs
Strengths
- Closed form: exact (up to numerical precision), no hyperparameter tuning.
- Transparent math; easy to debug and reason about.
- Quick on small-to-medium $d$ (dimensions) and $n$ (samples).
Limitations
- Scales poorly with large $d$ (cost of inversion $O(d^3)$).
- Sensitive to multicollinearity ($\tilde X^\top\tilde X$ near-singular => numerical instability).
- Assumes linear relationship and i.i.d. noise (Gaussian-ish for unbiasedness).
- Deterministic reproducibility can be broken by system BLAS behavior and floating point nondeterminism.
Trade-offs
- Closed-form vs GD: closed form is exact but expensive; GD scales to large $d$ and streaming but requires learning-rate tuning.
- Regularization trades bias for variance (reduces overfitting, helps numeric stability).
🔍 Variants & Extensions
- Ridge regression (L2): adds $\lambda$ to diagonal; improves numeric stability.
- Lasso (L1): sparsity but no closed-form; requires optimization.
- Weighted least squares: per-sample weights (heteroscedastic noise).
- Stochastic/mini-batch gradient descent: for large datasets.
- Bayesian linear regression: place priors on $\theta$ → posterior distribution.
- Polynomial / basis expansion: nonlinear features with linear model.
- Robust regression: Huber loss or RANSAC to handle outliers.
🚧 Common Challenges & Pitfalls (and how to avoid them)
-
Not controlling all randomness:
- Fix seeds for
numpy.random
,random
, and any library used. Save seed in config. - Also fix shuffling order for train/val split deterministically.
- Fix seeds for
-
BLAS / multithreading nondeterminism:
- Different BLAS libraries or thread counts can slightly change results (order of summation). Record
MKL_NUM_THREADS
,OMP_NUM_THREADS
, and consider setting them to1
for strict reproducibility. - On CI, ensure same NumPy / BLAS versions.
- Different BLAS libraries or thread counts can slightly change results (order of summation). Record
-
Floating point rounding:
- Small numeric differences are expected. For strict equality, you need fixed environment; otherwise report tolerances (e.g.,
rtol=1e-6
).
- Small numeric differences are expected. For strict equality, you need fixed environment; otherwise report tolerances (e.g.,
-
Singular matrix in normal equation:
- Use
np.linalg.pinv
(SVD) or add ridge regularization.
- Use
-
Incorrect bias handling:
- Forgetting to add column-of-ones leads to wrong intercept. Be explicit: either handle
b
separately or augmentX
.
- Forgetting to add column-of-ones leads to wrong intercept. Be explicit: either handle
-
Data leakage between train and val:
- Always split before standardization/normalization, or compute scaler on train and apply to val.
-
Logging insufficient info:
- Always log config, seed(s), training curve, final params, environment (versions), and random states if possible.
-
Validation metric mismatch:
- Use same MSE definition for train and val; log per-sample count used.
Practical Implementation — Minimal, reproducible NumPy script
Below is a self-contained script you can paste into a file (e.g., exp_linear_reg.py
) and run. It implements:
- deterministic synthetic data generation,
- closed-form and GD solvers,
- config saved as YAML,
- logs train/val MSE to CSV and prints results,
- saves numpy
.npz
with parameters and metrics.
# exp_linear_reg.py
import os
import json
import random
from datetime import datetime
import argparse
import numpy as np
import yaml
import csv
import platform
import sys
def set_global_seed(seed):
random.seed(seed)
np.random.seed(seed)
# Environment threads: optionally set outside Python before running:
# os.environ['OMP_NUM_THREADS'] = '1'
# os.environ['MKL_NUM_THREADS'] = '1'
def gen_synthetic_linear(n_samples, n_features, noise_std, seed):
set_global_seed(seed)
X = np.random.randn(n_samples, n_features)
true_w = np.random.randn(n_features)
true_b = float(np.random.randn(1))
noise = np.random.randn(n_samples) * noise_std
y = X.dot(true_w) + true_b + noise
return X, y, true_w, true_b
def augment_bias(X):
n = X.shape[0]
ones = np.ones((n,1))
return np.hstack([ones, X])
def closed_form_theta(X, y, reg=0.0):
# X is augmented (n x (d+1))
d = X.shape[1]
A = X.T.dot(X)
if reg > 0:
A = A + reg * np.eye(d)
# Use pinv for stability
theta = np.linalg.pinv(A).dot(X.T).dot(y)
return theta
def mse(y_true, y_pred):
return float(np.mean((y_true - y_pred)**2))
def gradient_descent(X, y, lr=0.01, n_epochs=1000, batch_size=None, verbose=False):
n, d = X.shape
theta = np.zeros(d)
if batch_size is None:
batch_size = n
indices = np.arange(n)
history = []
for epoch in range(n_epochs):
# deterministic shuffle per epoch:
np.random.shuffle(indices)
for start in range(0, n, batch_size):
batch_idx = indices[start:start+batch_size]
Xb = X[batch_idx]
yb = y[batch_idx]
pred = Xb.dot(theta)
grad = (-2.0 / Xb.shape[0]) * Xb.T.dot(yb - pred)
theta = theta - lr * grad
# log full-batch MSE after epoch
pred_full = X.dot(theta)
history.append(mse(y, pred_full))
if verbose and (epoch % (n_epochs//5 + 1) == 0):
print(f"Epoch {epoch}/{n_epochs} MSE={history[-1]:.6f}")
return theta, history
def save_config(cfg, path):
with open(path, 'w') as f:
yaml.safe_dump(cfg, f)
def save_results(path, results):
# results: dict -> JSON serializable
with open(path, 'w') as f:
json.dump(results, f, indent=2)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=42)
parser.add_argument('--n_samples', type=int, default=200)
parser.add_argument('--n_features', type=int, default=5)
parser.add_argument('--noise_std', type=float, default=0.5)
parser.add_argument('--val_frac', type=float, default=0.2)
parser.add_argument('--lr', type=float, default=0.05)
parser.add_argument('--epochs', type=int, default=500)
parser.add_argument('--reg', type=float, default=0.0)
args = parser.parse_args()
cfg = vars(args)
timestamp = datetime.utcnow().strftime('%Y%m%dT%H%M%SZ')
out_dir = f"results/linear_{timestamp}_seed{args.seed}"
os.makedirs(out_dir, exist_ok=True)
save_config(cfg, os.path.join(out_dir, 'config.yaml'))
set_global_seed(args.seed)
X, y, true_w, true_b = gen_synthetic_linear(args.n_samples, args.n_features, args.noise_std, args.seed)
# split
n_val = int(args.val_frac * args.n_samples)
indices = np.arange(args.n_samples)
np.random.shuffle(indices)
val_idx = indices[:n_val]
train_idx = indices[n_val:]
X_train = X[train_idx]; y_train = y[train_idx]
X_val = X[val_idx]; y_val = y[val_idx]
X_train_aug = augment_bias(X_train)
X_val_aug = augment_bias(X_val)
# Closed form
theta_cf = closed_form_theta(X_train_aug, y_train, reg=args.reg)
pred_train_cf = X_train_aug.dot(theta_cf)
pred_val_cf = X_val_aug.dot(theta_cf)
train_mse_cf = mse(y_train, pred_train_cf)
val_mse_cf = mse(y_val, pred_val_cf)
# Gradient descent (on augmented X)
theta_gd, history = gradient_descent(X_train_aug, y_train, lr=args.lr, n_epochs=args.epochs, verbose=False)
pred_train_gd = X_train_aug.dot(theta_gd)
pred_val_gd = X_val_aug.dot(theta_gd)
train_mse_gd = mse(y_train, pred_train_gd)
val_mse_gd = mse(y_val, pred_val_gd)
# Save artifacts
np.savez(os.path.join(out_dir, 'params.npz'),
theta_cf=theta_cf, theta_gd=theta_gd,
true_w=true_w, true_b=true_b)
# Save metrics
metrics = {
'train_mse_cf': train_mse_cf,
'val_mse_cf': val_mse_cf,
'train_mse_gd': train_mse_gd,
'val_mse_gd': val_mse_gd,
'gd_history': history
}
save_results(os.path.join(out_dir, 'metrics.json'), metrics)
# Save a CSV log of epoch vs MSE for GD
csv_path = os.path.join(out_dir, 'gd_mse_log.csv')
with open(csv_path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['epoch', 'train_mse'])
for i, m in enumerate(history):
writer.writerow([i, m])
# Save environment info
env = {
'python': sys.version,
'platform': platform.platform(),
'numpy': np.__version__,
'seed': args.seed
}
save_results(os.path.join(out_dir, 'env.json'), env)
print("Results saved to:", out_dir)
print("CF train/val MSE:", train_mse_cf, val_mse_cf)
print("GD train/val MSE:", train_mse_gd, val_mse_gd)
if __name__ == '__main__':
main()
How to run
python exp_linear_reg.py --seed 123 --n_samples 500 --n_features 10 --noise_std 0.2 --lr 0.01 --epochs 1000
Reproducibility details & best practices (practical)
-
Seed everything
np.random.seed(SEED)
andrandom.seed(SEED)
. SaveSEED
to config. If using other libs (PyTorch, TF) set their seeds.
-
Deterministic data generation
- Prefer deterministic feature generation (e.g.,
np.random.RandomState(seed)
) and avoid implicit randomness inside functions.
- Prefer deterministic feature generation (e.g.,
-
Save the entire experiment config
- Include:
seed
,n_samples
,n_features
,noise_std
,train/val split strategy
,batch_size
,lr
,epochs
,reg
,timestamp
,git commit hash
(if applicable). Save as YAML or JSON.
- Include:
-
Save model parameters & metrics
- Save final
theta
,true_w
,true_b
(for synthetic),train_mse
,val_mse
, and training curve. Use.npz
and.json
for structured saving.
- Save final
-
Log everything
- Human-readable logs + machine-readable logs (CSV/JSON). Record
epoch
,train_mse
,val_mse
, and runtime.
- Human-readable logs + machine-readable logs (CSV/JSON). Record
-
Record environment
- Python version, NumPy version, BLAS library, OS. Tip:
np.__config__.show()
gives build info. Save as part ofenv.json
.
- Python version, NumPy version, BLAS library, OS. Tip:
-
Control BLAS threads (if strict reproducibility required)
- Set
OMP_NUM_THREADS=1
,MKL_NUM_THREADS=1
before running; document in config.
- Set
-
Numeric stability
- Prefer
np.linalg.pinv
(SVD) overinv
for normal equation in poorly conditioned cases. Standardize features (zero mean, unit variance) for better conditioning.
- Prefer
-
Train/val split determinism
- Use seeded shuffling (
np.random.RandomState(seed).permutation
) and save the split indices to disk for exact reproduction.
- Use seeded shuffling (
-
Tolerance checks
- Use absolute/relative tolerances in tests: e.g., assert that
|val_mse_run - val_mse_saved| < 1e-8
orrtol=1e-6
.
- Use absolute/relative tolerances in tests: e.g., assert that
Interview-ready talking points (short bullets)
- Why test on synthetic data? (Control, known ground truth for debugging.)
- When to prefer closed-form vs iterative solvers (runtime & numerics).
- The role of regularization in numerical stability and bias-variance trade-off.
- Sources of non-determinism: random seeds, multithreading/BLAS, floating-point rounding.
- Practical steps to make experiments reproducible and how to document them (config, env, artifacts).
- How to validate implementation correctness: compare closed-form theta with GD result, compare recovered weights to true weights.
📚 Reference Pointers
- Linear regression overview: Wikipedia — Linear regression
- Matrix pseudoinverse & SVD: Wikipedia — Moore–Penrose inverse
- Numerical methods & regularization: Hastie, Tibshirani & Friedman, The Elements of Statistical Learning. (classic text) — https://web.stanford.edu/~hastie/ElemStatLearn/
- Bishop, C. M. Pattern Recognition and Machine Learning (for probabilistic interpretation). https://scholar.google.com/scholar?q=Bishop+PRML
- Reproducibility in ML (discussion & practices): “Reproducibility in Machine Learning” articles and community posts — e.g., Nature: Challenges in reproducing ML research
- For BLAS/MKL thread issues and NumPy: NumPy docs and np.config.show() (searchable in official NumPy docs).