Reproducible Experiment: Linear Regression from Scratch

📚 Notes

Journey

Day1

9 min read 1860 words

Overview (what you’ll learn)

How to implement ordinary linear regression from first principles (closed-form + gradient descent) using NumPy.
How to generate synthetic data with controlled noise and reproducible randomness.
How to save experiment configuration & seed, and log train/val MSE.
Best practices and subtle pitfalls for reproducible experiments (randomness sources, numeric stability, BLAS nondeterminism).
Key math derivations and short proofs (collapsible).

Quick checklist (for interviews / hands-on)

Fix seeds for all random sources (numpy, python random, optionally torch if used).
Save experiment config (YAML/JSON) containing seed, dataset params, optimizer settings, eval split.
Generate synthetic data deterministically from seed; save dataset or its generation parameters.
Implement both normal equation (closed-form) and gradient descent as alternatives; validate they match (within numeric error).
Log MSE (train & val) every epoch/step; save final model parameters and metrics as JSON/CSV.
Record environment info (Python/Numpy versions, BLAS threads / MKL/OpenBLAS details).
Ensure deterministic compute (set OMP_NUM_THREADS, MKL_NUM_THREADS, or mention non-determinism caveat).

🎯 Core Idea

Implement and run a reproducible linear regression experiment (NumPy from-scratch). Reproducibility means: given the saved config + seed + code + environment, another person (or CI) can re-run and obtain the same numerical results (modulo tiny floating-point differences).

🌱 Intuition & Real-World Analogy

Why synthetic data with controlled noise? It isolates algorithmic correctness: you control the data generation (true weights, noise level) so you can verify whether your estimator recovers the underlying parameters and how noise affects MSE.
Analogies
1. Building a recipe and writing down exact oven temperature and ingredient brands — so anyone following it gets the same cake. Seed=config = the recipe, logging = tasting notes.
2. Calibrating a scale with known weights: synthetic data are the known weights, algorithm is the scale — if it reads correctly, it’s calibrated.

📐 Mathematical Foundation

Problem statement (linear regression)

Given dataset $(X, y)$ with $X \in \mathbb{R}^{n\times d}$ (design matrix) and $y\in\mathbb{R}^n$, model:

$$ \hat{y} = Xw + b \quad \text{(or augment $X$ with a column of ones and use } \tilde{w}). $$

We’ll use the augmented form: $\tilde X = [\mathbf{1}, X]$, and $\theta \in \mathbb{R}^{d+1}$:

$$ \hat{y} = \tilde X \theta. $$

Loss: Mean Squared Error (MSE)

$$ \mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i - \tilde X_i \theta)^2 = \frac{1}{n} \lVert y - \tilde X \theta \rVert_2^2. $$

Break down:
- $n$: number of samples.
- $\tilde X_i$: i-th row (1 × (d+1)).
- $\theta$: parameter vector (d+1 × 1).

Closed-form solution (Normal equation)

Minimize $\mathcal{L}$: set gradient to zero:

$$ \nabla_\theta \mathcal{L} = -\frac{2}{n}\tilde X^\top (y - \tilde X \theta) = 0 \Rightarrow \tilde X^\top \tilde X \theta = \tilde X^\top y. $$

Solution:

$$ \theta^* = (\tilde X^\top \tilde X)^{-1}\tilde X^\top y. $$

Use pseudoinverse when $\tilde X^\top \tilde X$ is singular:

$$ \theta^* = \tilde X^+ y \quad (\tilde X^+ = \text{Moore–Penrose pseudoinverse}). $$

Gradient descent update

Gradient:

$$ \nabla_\theta \mathcal{L} = -\frac{2}{n}\tilde X^\top (y - \tilde X \theta). $$

Update (learning rate $\eta$):

$$ \theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} = \theta + \frac{2\eta}{n}\tilde X^\top (y - \tilde X \theta). $$

Batch/mini-batch variants apply similarly (replace full gradient with minibatch).

Regularized (Ridge) regression

$$ \mathcal{L}_\lambda(\theta) = \frac{1}{n} \lVert y - \tilde X\theta \rVert^2_2 + \lambda \lVert \theta_{1:} \rVert_2^2 $$

(normal equation)

$$ \theta^* = (\tilde X^\top \tilde X + n\lambda I)^{-1} \tilde X^\top y. $$

(Usually bias term not regularized; then add zeros for bias index.)

Derivation: normal equation (short)

Start from gradient = 0:

$$ -\frac{2}{n}\tilde X^\top (y - \tilde X \theta)=0 \Rightarrow \tilde X^\top \tilde X \theta = \tilde X^\top y. $$

If $\tilde X^\top \tilde X$ invertible, solve for $\theta$. If not, use SVD/pinv for stable solution.

⚖️ Strengths, Limitations & Trade-offs

Strengths

Closed form: exact (up to numerical precision), no hyperparameter tuning.
Transparent math; easy to debug and reason about.
Quick on small-to-medium $d$ (dimensions) and $n$ (samples).

Limitations

Scales poorly with large $d$ (cost of inversion $O(d^3)$).
Sensitive to multicollinearity ($\tilde X^\top\tilde X$ near-singular => numerical instability).
Assumes linear relationship and i.i.d. noise (Gaussian-ish for unbiasedness).
Deterministic reproducibility can be broken by system BLAS behavior and floating point nondeterminism.

Trade-offs

Closed-form vs GD: closed form is exact but expensive; GD scales to large $d$ and streaming but requires learning-rate tuning.
Regularization trades bias for variance (reduces overfitting, helps numeric stability).

🔍 Variants & Extensions

Ridge regression (L2): adds $\lambda$ to diagonal; improves numeric stability.
Lasso (L1): sparsity but no closed-form; requires optimization.
Weighted least squares: per-sample weights (heteroscedastic noise).
Stochastic/mini-batch gradient descent: for large datasets.
Bayesian linear regression: place priors on $\theta$ → posterior distribution.
Polynomial / basis expansion: nonlinear features with linear model.
Robust regression: Huber loss or RANSAC to handle outliers.

🚧 Common Challenges & Pitfalls (and how to avoid them)

Not controlling all randomness:
- Fix seeds for numpy.random, random, and any library used. Save seed in config.
- Also fix shuffling order for train/val split deterministically.
BLAS / multithreading nondeterminism:
- Different BLAS libraries or thread counts can slightly change results (order of summation). Record MKL_NUM_THREADS, OMP_NUM_THREADS, and consider setting them to 1 for strict reproducibility.
- On CI, ensure same NumPy / BLAS versions.
Floating point rounding:
- Small numeric differences are expected. For strict equality, you need fixed environment; otherwise report tolerances (e.g., rtol=1e-6).
Singular matrix in normal equation:
- Use np.linalg.pinv (SVD) or add ridge regularization.
Incorrect bias handling:
- Forgetting to add column-of-ones leads to wrong intercept. Be explicit: either handle b separately or augment X.
Data leakage between train and val:
- Always split before standardization/normalization, or compute scaler on train and apply to val.
Logging insufficient info:
- Always log config, seed(s), training curve, final params, environment (versions), and random states if possible.
Validation metric mismatch:
- Use same MSE definition for train and val; log per-sample count used.

Practical Implementation — Minimal, reproducible NumPy script

Below is a self-contained script you can paste into a file (e.g., exp_linear_reg.py) and run. It implements:

deterministic synthetic data generation,
closed-form and GD solvers,
config saved as YAML,
logs train/val MSE to CSV and prints results,
saves numpy .npz with parameters and metrics.

# exp_linear_reg.py
import os
import json
import random
from datetime import datetime
import argparse
import numpy as np
import yaml
import csv
import platform
import sys

def set_global_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    # Environment threads: optionally set outside Python before running:
    # os.environ['OMP_NUM_THREADS'] = '1'
    # os.environ['MKL_NUM_THREADS'] = '1'

def gen_synthetic_linear(n_samples, n_features, noise_std, seed):
    set_global_seed(seed)
    X = np.random.randn(n_samples, n_features)
    true_w = np.random.randn(n_features)
    true_b = float(np.random.randn(1))
    noise = np.random.randn(n_samples) * noise_std
    y = X.dot(true_w) + true_b + noise
    return X, y, true_w, true_b

def augment_bias(X):
    n = X.shape[0]
    ones = np.ones((n,1))
    return np.hstack([ones, X])

def closed_form_theta(X, y, reg=0.0):
    # X is augmented (n x (d+1))
    d = X.shape[1]
    A = X.T.dot(X)
    if reg > 0:
        A = A + reg * np.eye(d)
    # Use pinv for stability
    theta = np.linalg.pinv(A).dot(X.T).dot(y)
    return theta

def mse(y_true, y_pred):
    return float(np.mean((y_true - y_pred)**2))

def gradient_descent(X, y, lr=0.01, n_epochs=1000, batch_size=None, verbose=False):
    n, d = X.shape
    theta = np.zeros(d)
    if batch_size is None:
        batch_size = n
    indices = np.arange(n)
    history = []
    for epoch in range(n_epochs):
        # deterministic shuffle per epoch:
        np.random.shuffle(indices)
        for start in range(0, n, batch_size):
            batch_idx = indices[start:start+batch_size]
            Xb = X[batch_idx]
            yb = y[batch_idx]
            pred = Xb.dot(theta)
            grad = (-2.0 / Xb.shape[0]) * Xb.T.dot(yb - pred)
            theta = theta - lr * grad
        # log full-batch MSE after epoch
        pred_full = X.dot(theta)
        history.append(mse(y, pred_full))
        if verbose and (epoch % (n_epochs//5 + 1) == 0):
            print(f"Epoch {epoch}/{n_epochs} MSE={history[-1]:.6f}")
    return theta, history

def save_config(cfg, path):
    with open(path, 'w') as f:
        yaml.safe_dump(cfg, f)

def save_results(path, results):
    # results: dict -> JSON serializable
    with open(path, 'w') as f:
        json.dump(results, f, indent=2)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed', type=int, default=42)
    parser.add_argument('--n_samples', type=int, default=200)
    parser.add_argument('--n_features', type=int, default=5)
    parser.add_argument('--noise_std', type=float, default=0.5)
    parser.add_argument('--val_frac', type=float, default=0.2)
    parser.add_argument('--lr', type=float, default=0.05)
    parser.add_argument('--epochs', type=int, default=500)
    parser.add_argument('--reg', type=float, default=0.0)
    args = parser.parse_args()

    cfg = vars(args)
    timestamp = datetime.utcnow().strftime('%Y%m%dT%H%M%SZ')
    out_dir = f"results/linear_{timestamp}_seed{args.seed}"
    os.makedirs(out_dir, exist_ok=True)
    save_config(cfg, os.path.join(out_dir, 'config.yaml'))

    set_global_seed(args.seed)
    X, y, true_w, true_b = gen_synthetic_linear(args.n_samples, args.n_features, args.noise_std, args.seed)
    # split
    n_val = int(args.val_frac * args.n_samples)
    indices = np.arange(args.n_samples)
    np.random.shuffle(indices)
    val_idx = indices[:n_val]
    train_idx = indices[n_val:]
    X_train = X[train_idx]; y_train = y[train_idx]
    X_val = X[val_idx]; y_val = y[val_idx]

    X_train_aug = augment_bias(X_train)
    X_val_aug = augment_bias(X_val)

    # Closed form
    theta_cf = closed_form_theta(X_train_aug, y_train, reg=args.reg)
    pred_train_cf = X_train_aug.dot(theta_cf)
    pred_val_cf = X_val_aug.dot(theta_cf)
    train_mse_cf = mse(y_train, pred_train_cf)
    val_mse_cf = mse(y_val, pred_val_cf)

    # Gradient descent (on augmented X)
    theta_gd, history = gradient_descent(X_train_aug, y_train, lr=args.lr, n_epochs=args.epochs, verbose=False)
    pred_train_gd = X_train_aug.dot(theta_gd)
    pred_val_gd = X_val_aug.dot(theta_gd)
    train_mse_gd = mse(y_train, pred_train_gd)
    val_mse_gd = mse(y_val, pred_val_gd)

    # Save artifacts
    np.savez(os.path.join(out_dir, 'params.npz'),
             theta_cf=theta_cf, theta_gd=theta_gd,
             true_w=true_w, true_b=true_b)
    # Save metrics
    metrics = {
        'train_mse_cf': train_mse_cf,
        'val_mse_cf': val_mse_cf,
        'train_mse_gd': train_mse_gd,
        'val_mse_gd': val_mse_gd,
        'gd_history': history
    }
    save_results(os.path.join(out_dir, 'metrics.json'), metrics)

    # Save a CSV log of epoch vs MSE for GD
    csv_path = os.path.join(out_dir, 'gd_mse_log.csv')
    with open(csv_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['epoch', 'train_mse'])
        for i, m in enumerate(history):
            writer.writerow([i, m])

    # Save environment info
    env = {
        'python': sys.version,
        'platform': platform.platform(),
        'numpy': np.__version__,
        'seed': args.seed
    }
    save_results(os.path.join(out_dir, 'env.json'), env)

    print("Results saved to:", out_dir)
    print("CF train/val MSE:", train_mse_cf, val_mse_cf)
    print("GD train/val MSE:", train_mse_gd, val_mse_gd)

if __name__ == '__main__':
    main()

How to run

python exp_linear_reg.py --seed 123 --n_samples 500 --n_features 10 --noise_std 0.2 --lr 0.01 --epochs 1000

Reproducibility details & best practices (practical)

Seed everything
- np.random.seed(SEED) and random.seed(SEED). Save SEED to config. If using other libs (PyTorch, TF) set their seeds.
Deterministic data generation
- Prefer deterministic feature generation (e.g., np.random.RandomState(seed)) and avoid implicit randomness inside functions.
Save the entire experiment config
- Include: seed, n_samples, n_features, noise_std, train/val split strategy, batch_size, lr, epochs, reg, timestamp, git commit hash (if applicable). Save as YAML or JSON.
Save model parameters & metrics
- Save final theta, true_w, true_b (for synthetic), train_mse, val_mse, and training curve. Use .npz and .json for structured saving.
Log everything
- Human-readable logs + machine-readable logs (CSV/JSON). Record epoch, train_mse, val_mse, and runtime.
Record environment
- Python version, NumPy version, BLAS library, OS. Tip: np.__config__.show() gives build info. Save as part of env.json.
Control BLAS threads (if strict reproducibility required)
- Set OMP_NUM_THREADS=1, MKL_NUM_THREADS=1 before running; document in config.
Numeric stability
- Prefer np.linalg.pinv (SVD) over inv for normal equation in poorly conditioned cases. Standardize features (zero mean, unit variance) for better conditioning.
Train/val split determinism
- Use seeded shuffling (np.random.RandomState(seed).permutation) and save the split indices to disk for exact reproduction.
Tolerance checks
- Use absolute/relative tolerances in tests: e.g., assert that |val_mse_run - val_mse_saved| < 1e-8 or rtol=1e-6.

Interview-ready talking points (short bullets)

Why test on synthetic data? (Control, known ground truth for debugging.)
When to prefer closed-form vs iterative solvers (runtime & numerics).
The role of regularization in numerical stability and bias-variance trade-off.
Sources of non-determinism: random seeds, multithreading/BLAS, floating-point rounding.
Practical steps to make experiments reproducible and how to document them (config, env, artifacts).
How to validate implementation correctness: compare closed-form theta with GD result, compare recovered weights to true weights.

📚 Reference Pointers

Linear regression overview: Wikipedia — Linear regression
Matrix pseudoinverse & SVD: Wikipedia — Moore–Penrose inverse
Numerical methods & regularization: Hastie, Tibshirani & Friedman, The Elements of Statistical Learning. (classic text) — https://web.stanford.edu/~hastie/ElemStatLearn/
Bishop, C. M. Pattern Recognition and Machine Learning (for probabilistic interpretation). https://scholar.google.com/scholar?q=Bishop+PRML
Reproducibility in ML (discussion & practices): “Reproducibility in Machine Learning” articles and community posts — e.g., Nature: Challenges in reproducing ML research
For BLAS/MKL thread issues and NumPy: NumPy docs and np.config.show() (searchable in official NumPy docs).

OLS Assumptions: Linearity, Homoscedasticity & Independence Guide