Scaling Solutions: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

7 min read 1326 words

🎯 Core Idea

When the dataset is huge (e.g., millions of rows) you avoid the closed-form OLS inversion ((X^T X)^{-1} X^T y) because forming and inverting X^T X becomes computationally and numerically expensive. Instead, use iterative / approximate solvers (stochastic gradient descent and its mini-batch variants, conjugate gradient, L-BFGS, variance-reduced methods, sketching) that (a) operate in streaming or low-memory fashion, (b) exploit data sparsity and parallelism, and (c) give good approximate solutions much faster than exact inversion.

🌱 Intuition & Real-World Analogy

Think of closed-form OLS as solving a giant jigsaw by first gluing every piece together into one monstrous puzzle (compute X^T X), then carving out the solution by expensive algebraic surgery (matrix inversion). If the puzzle has 100M pieces, you neither have the table nor the glue — you need to assemble the picture incrementally.
SGD / mini-batch GD are like painting the picture stroke by stroke: each stroke (a small batch) improves the image; you never need to hold all paints/pieces in memory at once.
Sketching is like taking a good snapshot (a compressed summary) of the puzzle that preserves the main structure so you can solve a much smaller problem and still get a decent match.

📐 Mathematical Foundation

Problem: ordinary least squares

$$ \min_{w\in\mathbb{R}^p} \; L(w) = \frac{1}{2n}\sum_{i=1}^{n} (x_i^\top w - y_i)^2 $$

Matrix form:

$$ L(w)=\frac{1}{2n}\|Xw-y\|_2^2. $$

Closed-form (normal equations):

$$ w^\star = (X^\top X)^{-1} X^\top y. $$

Breakdown:

$X$ is $n\times p$.
Forming $X^\top X$ costs $O(n p^2)$.
Inverting a $p\times p$ matrix costs $O(p^3)$ (Cholesky ~ $O(p^3/3)$ for dense).
Memory to hold $X$ is $O(np)$ (often impossible when $n$ large).

Iterative solvers — gradient and stochastic gradient:

Full gradient:

$$ \nabla L(w) = \frac{1}{n} X^\top (Xw - y). $$

Gradient descent (batch):

$$ w_{t+1} = w_t - \eta_t \nabla L(w_t). $$

Stochastic gradient (single-sample):

$$ w_{t+1} = w_t - \eta_t\, (x_{i(t)}^\top w_t - y_{i(t)}) x_{i(t)}. $$

Mini-batch of size $b$:

$$ \nabla_b L(w) = \frac{1}{b}\sum_{i\in\mathcal{B}} (x_i^\top w - y_i)x_i. $$

Convergence (high-level):

For strongly convex & smooth problems (e.g., ridge regression), gradient descent has linear convergence (error $\propto (1 - c)^{t}$, with condition number $\kappa$ affecting rate).
SGD has sublinear convergence in expectation; with diminishing step sizes it achieves $O(1/t)$ rate for strongly convex objectives in expectation (variance of stochastic gradients slows ultimate convergence). (Technical proofs are longer — see references.)

Complexity comparison (big-O, conceptual):

Closed-form: form $X^\top X$ — $O(n p^2)$; invert — $O(p^3)$. Total: $O(n p^2 + p^3)$.
Batch GD: each iteration $O(n p)$; needs many iterations.
Mini-batch SGD: each iteration $O(b p)$ (with $b\ll n$), can converge faster in wall time due to cheaper iterations and parallelism.
Conjugate gradient (CG) for least squares (applied to normal equations or directly using Lanczos on $X$) costs $O(k np)$ for $k$ iterations; often $k\ll p$. Preconditioning reduces $k$.
Sketching: reduce $n$ to $s$ with $s\approx O(p\log p/\varepsilon^2)$, solve sketched system in $O(s p^2)$ or cheaper depending on method.

Derivation: where $O(n p^2 + p^3)$ comes from

Compute $X^\top X$: sum outer products of rows: cost $O(n p^2)$.
Cholesky/inverse of $p\times p$: $O(p^3)$. Hence total $O(n p^2 + p^3)$. If $n\gg p$ the $n p^2$ term dominates; if $p$ is large, the $p^3$ term dominates. Num stability: forming $X^\top X$ squares the condition number (worse numerical stability). Solving via QR or SVD is more stable but costlier (SVD ~ $O(\min{np^2, n^2p})$).

⚖️ Strengths, Limitations & Trade-offs

Closed-form OLS

Strengths
- Exact solution (up to numerical error).
- One pass to compute $X^\top X$ (can be done online if $p$ small) then algebra.
Limitations
- Memory: storing $X$ or even performing a pass to compute $X^\top X$ is heavy when $n$ huge.
- Time: $O(n p^2 + p^3)$ may be prohibitive.
- Numerical stability: forming normal equations squares condition number; unstable for ill-conditioned features.
When it’s OK
- $p$ small (say hundreds or a few thousands) and $n$ fits I/O constraints; or when you can reliably stream and form $X^\top X$ and $p$ is modest.

Iterative/Approximate methods (SGD, mini-batch, CG, L-BFGS, SVRG, sketching)

Strengths
- Low memory — streaming and out-of-core.
- Cheap per-iteration cost (especially mini-batch SGD).
- Parallelizable across examples (data parallel).
- Methods like SVRG/SAGA give faster convergence via variance reduction.
- Sketching gives provable worst-case approximation with massive speedups.
Limitations
- Approximate — requires stopping criteria and tolerance decisions.
- Hyperparameters (learning rate, batch size, preconditioner) matter.
- SGD suffers from variance; may oscillate around optimum unless step sizes decay.
- Some sophisticated methods require extra storage (e.g., SAGA stores gradients) or coordination in distributed settings.

Trade-off summary: closed-form = accuracy + heavy resources; iterative = approximate + cheap resources + hyperparameter tuning.

🔍 Variants & Extensions

Mini-batch SGD — trade between per-iteration variance and throughput; commonly used in practice for large datasets.
Momentum / Nesterov accelerated GD — faster transient behavior, reduces oscillation.
Variance-reduced methods: SVRG, SAGA, SAG — combine stochastic cheap steps with occasional correction for variance; faster asymptotic convergence.
Quasi-Newton (L-BFGS) — maintains low-rank curvature estimate; good for high-accuracy solutions when $p$ moderate and you can make multiple passes.
Conjugate Gradient (CG) — powerful for symmetric positive definite systems (solve normal equations or apply CG to $X$ via matrix–vector products). Often converges in $\le p$ iterations, but in practice much less if eigenvalues cluster.
Randomized sketching (random projections, leverage-score sampling, SRHT) — reduce $n$ drastically, then solve smaller problem. Provable approximation bounds.
Preconditioning — scale the system so iterative methods converge faster (diagonal/approximate inverse preconditioners).
Streaming & reservoir sampling — for model updates in one pass with bounded memory.

🚧 Common Challenges & Pitfalls

Thinking “more passes = always better”: SGD can overfit or waste time if learning rate not tuned. Use validation and early stopping.
Using normal equations blindly: numerical instability for ill-conditioned $X$. Prefer QR or SVD (if feasible) or ridge regularization to stabilize.
Ignoring data I/O and shuffling: in SGD, lack of randomization or poor ordering can slow convergence. Shuffle or use random sampling.
Wrong batch size assumptions: tiny $b$ reduces computation but increases gradient variance; very large $b$ approaches batch GD with expensive iterations — choose $b$ to fit hardware (e.g., parallel workers).
Poor conditioning and slow convergence: scale/standardize features; add ridge ($\lambda I$) to improve conditioning.
Assuming SGD always wins: for small $n$ or when high accuracy is needed, deterministic or quasi-Newton methods often outperform SGD in number of passes.
Distributed inconsistency: asynchronous updates (Hogwild!) can be fast but may degrade convergence for dense features; ensure algorithm is suited to sparsity pattern.

✅ Practical Recommendations (Short & Actionable)

If $n$ is huge (e.g., 100M rows) and $p$ is moderate:
- Do NOT compute closed-form OLS unless you have a streaming process to accumulate $X^\top X$ and $p$ is small and you accept numerical risks. Use iterative solvers.
- Prefer mini-batch SGD or L-BFGS with data streaming (multiple passes allowed) for good practical performance.
- Consider randomized sketching to compress rows to a smaller representative set, then solve the sketched problem.
- Add ridge regularization to improve conditioning for iterative methods.
- Use sensible feature scaling and shuffle data for SGD.
If $p$ is also very large (high-dimensional): look into sparse solvers, coordinate descent, or dimensionality reduction (feature hashing, random projections).

Direct answer to probing question:

“If you have 100M rows, would you compute the closed-form OLS solution?” No. Practically you should avoid forming and inverting $X^\top X$ on 100M rows. Use streaming iterative methods (mini-batch SGD, L-BFGS, conjugate gradient with preconditioning) and/or randomized sketching to get accurate solutions with feasible compute and memory.

📚 Reference Pointers (for deeper reading)

Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (chapters on linear methods and computational aspects). https://web.stanford.edu/~hastie/ElemStatLearn/
Nocedal & Wright — Numerical Optimization (quasi-Newton, CG, preconditioning). https://link.springer.com/book/10.1007/978-0-387-40065-5
Bottou, Curtis, Nocedal — “Optimization Methods for Large-Scale Machine Learning” (survey on SGD, variance reduction). arXiv:1606.04838. https://arxiv.org/abs/1606.04838
Johnson & Zhang — “Accelerating Stochastic Gradient Descent using Predictive Variance Reduction” (SVRG). https://arxiv.org/abs/1309.2388
Defazio, Bach, & Lacoste-Julien — “SAGA: A Fast Incremental Gradient Method” (variance reduced method). https://arxiv.org/abs/1407.0202
Tropp et al. — randomized sketching literature; see surveys on randomized numerical linear algebra (RandNLA). (Search terms: “randomized sketching least squares”, “SRHT”).

Scratch Implementation in Python: Linear Regression Regularization (Ridge, Lasso, ElasticNet): Linear Regression