Scaling Solutions: Linear Regression
🎯 Core Idea
When the dataset is huge (e.g., millions of rows) you avoid the closed-form OLS inversion ((X^T X)^{-1} X^T y
) because forming and inverting X^T X
becomes computationally and numerically expensive. Instead, use iterative / approximate solvers (stochastic gradient descent and its mini-batch variants, conjugate gradient, L-BFGS, variance-reduced methods, sketching) that (a) operate in streaming or low-memory fashion, (b) exploit data sparsity and parallelism, and (c) give good approximate solutions much faster than exact inversion.
🌱 Intuition & Real-World Analogy
- Think of closed-form OLS as solving a giant jigsaw by first gluing every piece together into one monstrous puzzle (compute
X^T X
), then carving out the solution by expensive algebraic surgery (matrix inversion). If the puzzle has 100M pieces, you neither have the table nor the glue — you need to assemble the picture incrementally. - SGD / mini-batch GD are like painting the picture stroke by stroke: each stroke (a small batch) improves the image; you never need to hold all paints/pieces in memory at once.
- Sketching is like taking a good snapshot (a compressed summary) of the puzzle that preserves the main structure so you can solve a much smaller problem and still get a decent match.
📐 Mathematical Foundation
Problem: ordinary least squares
$$ \min_{w\in\mathbb{R}^p} \; L(w) = \frac{1}{2n}\sum_{i=1}^{n} (x_i^\top w - y_i)^2 $$Matrix form:
$$ L(w)=\frac{1}{2n}\|Xw-y\|_2^2. $$Closed-form (normal equations):
$$ w^\star = (X^\top X)^{-1} X^\top y. $$Breakdown:
- $X$ is $n\times p$.
- Forming $X^\top X$ costs $O(n p^2)$.
- Inverting a $p\times p$ matrix costs $O(p^3)$ (Cholesky ~ $O(p^3/3)$ for dense).
- Memory to hold $X$ is $O(np)$ (often impossible when $n$ large).
Iterative solvers — gradient and stochastic gradient:
Full gradient:
$$ \nabla L(w) = \frac{1}{n} X^\top (Xw - y). $$Gradient descent (batch):
$$ w_{t+1} = w_t - \eta_t \nabla L(w_t). $$Stochastic gradient (single-sample):
$$ w_{t+1} = w_t - \eta_t\, (x_{i(t)}^\top w_t - y_{i(t)}) x_{i(t)}. $$Mini-batch of size $b$:
$$ \nabla_b L(w) = \frac{1}{b}\sum_{i\in\mathcal{B}} (x_i^\top w - y_i)x_i. $$Convergence (high-level):
- For strongly convex & smooth problems (e.g., ridge regression), gradient descent has linear convergence (error $\propto (1 - c)^{t}$, with condition number $\kappa$ affecting rate).
- SGD has sublinear convergence in expectation; with diminishing step sizes it achieves $O(1/t)$ rate for strongly convex objectives in expectation (variance of stochastic gradients slows ultimate convergence). (Technical proofs are longer — see references.)
Complexity comparison (big-O, conceptual):
- Closed-form: form $X^\top X$ — $O(n p^2)$; invert — $O(p^3)$. Total: $O(n p^2 + p^3)$.
- Batch GD: each iteration $O(n p)$; needs many iterations.
- Mini-batch SGD: each iteration $O(b p)$ (with $b\ll n$), can converge faster in wall time due to cheaper iterations and parallelism.
- Conjugate gradient (CG) for least squares (applied to normal equations or directly using Lanczos on $X$) costs $O(k np)$ for $k$ iterations; often $k\ll p$. Preconditioning reduces $k$.
- Sketching: reduce $n$ to $s$ with $s\approx O(p\log p/\varepsilon^2)$, solve sketched system in $O(s p^2)$ or cheaper depending on method.
Derivation: where $O(n p^2 + p^3)$ comes from
- Compute $X^\top X$: sum outer products of rows: cost $O(n p^2)$.
- Cholesky/inverse of $p\times p$: $O(p^3)$. Hence total $O(n p^2 + p^3)$. If $n\gg p$ the $n p^2$ term dominates; if $p$ is large, the $p^3$ term dominates. Num stability: forming $X^\top X$ squares the condition number (worse numerical stability). Solving via QR or SVD is more stable but costlier (SVD ~ $O(\min{np^2, n^2p})$).
⚖️ Strengths, Limitations & Trade-offs
Closed-form OLS
-
Strengths
- Exact solution (up to numerical error).
- One pass to compute $X^\top X$ (can be done online if $p$ small) then algebra.
-
Limitations
- Memory: storing $X$ or even performing a pass to compute $X^\top X$ is heavy when $n$ huge.
- Time: $O(n p^2 + p^3)$ may be prohibitive.
- Numerical stability: forming normal equations squares condition number; unstable for ill-conditioned features.
-
When it’s OK
- $p$ small (say hundreds or a few thousands) and $n$ fits I/O constraints; or when you can reliably stream and form $X^\top X$ and $p$ is modest.
Iterative/Approximate methods (SGD, mini-batch, CG, L-BFGS, SVRG, sketching)
-
Strengths
- Low memory — streaming and out-of-core.
- Cheap per-iteration cost (especially mini-batch SGD).
- Parallelizable across examples (data parallel).
- Methods like SVRG/SAGA give faster convergence via variance reduction.
- Sketching gives provable worst-case approximation with massive speedups.
-
Limitations
- Approximate — requires stopping criteria and tolerance decisions.
- Hyperparameters (learning rate, batch size, preconditioner) matter.
- SGD suffers from variance; may oscillate around optimum unless step sizes decay.
- Some sophisticated methods require extra storage (e.g., SAGA stores gradients) or coordination in distributed settings.
Trade-off summary: closed-form = accuracy + heavy resources; iterative = approximate + cheap resources + hyperparameter tuning.
🔍 Variants & Extensions
- Mini-batch SGD — trade between per-iteration variance and throughput; commonly used in practice for large datasets.
- Momentum / Nesterov accelerated GD — faster transient behavior, reduces oscillation.
- Variance-reduced methods: SVRG, SAGA, SAG — combine stochastic cheap steps with occasional correction for variance; faster asymptotic convergence.
- Quasi-Newton (L-BFGS) — maintains low-rank curvature estimate; good for high-accuracy solutions when $p$ moderate and you can make multiple passes.
- Conjugate Gradient (CG) — powerful for symmetric positive definite systems (solve normal equations or apply CG to $X$ via matrix–vector products). Often converges in $\le p$ iterations, but in practice much less if eigenvalues cluster.
- Randomized sketching (random projections, leverage-score sampling, SRHT) — reduce $n$ drastically, then solve smaller problem. Provable approximation bounds.
- Preconditioning — scale the system so iterative methods converge faster (diagonal/approximate inverse preconditioners).
- Streaming & reservoir sampling — for model updates in one pass with bounded memory.
🚧 Common Challenges & Pitfalls
- Thinking “more passes = always better”: SGD can overfit or waste time if learning rate not tuned. Use validation and early stopping.
- Using normal equations blindly: numerical instability for ill-conditioned $X$. Prefer QR or SVD (if feasible) or ridge regularization to stabilize.
- Ignoring data I/O and shuffling: in SGD, lack of randomization or poor ordering can slow convergence. Shuffle or use random sampling.
- Wrong batch size assumptions: tiny $b$ reduces computation but increases gradient variance; very large $b$ approaches batch GD with expensive iterations — choose $b$ to fit hardware (e.g., parallel workers).
- Poor conditioning and slow convergence: scale/standardize features; add ridge ($\lambda I$) to improve conditioning.
- Assuming SGD always wins: for small $n$ or when high accuracy is needed, deterministic or quasi-Newton methods often outperform SGD in number of passes.
- Distributed inconsistency: asynchronous updates (Hogwild!) can be fast but may degrade convergence for dense features; ensure algorithm is suited to sparsity pattern.
✅ Practical Recommendations (Short & Actionable)
-
If $n$ is huge (e.g., 100M rows) and $p$ is moderate:
- Do NOT compute closed-form OLS unless you have a streaming process to accumulate $X^\top X$ and $p$ is small and you accept numerical risks. Use iterative solvers.
- Prefer mini-batch SGD or L-BFGS with data streaming (multiple passes allowed) for good practical performance.
- Consider randomized sketching to compress rows to a smaller representative set, then solve the sketched problem.
- Add ridge regularization to improve conditioning for iterative methods.
- Use sensible feature scaling and shuffle data for SGD.
-
If $p$ is also very large (high-dimensional): look into sparse solvers, coordinate descent, or dimensionality reduction (feature hashing, random projections).
Direct answer to probing question:
“If you have 100M rows, would you compute the closed-form OLS solution?” No. Practically you should avoid forming and inverting $X^\top X$ on 100M rows. Use streaming iterative methods (mini-batch SGD, L-BFGS, conjugate gradient with preconditioning) and/or randomized sketching to get accurate solutions with feasible compute and memory.
📚 Reference Pointers (for deeper reading)
- Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (chapters on linear methods and computational aspects). https://web.stanford.edu/~hastie/ElemStatLearn/
- Nocedal & Wright — Numerical Optimization (quasi-Newton, CG, preconditioning). https://link.springer.com/book/10.1007/978-0-387-40065-5
- Bottou, Curtis, Nocedal — “Optimization Methods for Large-Scale Machine Learning” (survey on SGD, variance reduction). arXiv:1606.04838. https://arxiv.org/abs/1606.04838
- Johnson & Zhang — “Accelerating Stochastic Gradient Descent using Predictive Variance Reduction” (SVRG). https://arxiv.org/abs/1309.2388
- Defazio, Bach, & Lacoste-Julien — “SAGA: A Fast Incremental Gradient Method” (variance reduced method). https://arxiv.org/abs/1407.0202
- Tropp et al. — randomized sketching literature; see surveys on randomized numerical linear algebra (RandNLA). (Search terms: “randomized sketching least squares”, “SRHT”).