Scaling Solutions: Linear Regression

7 min read 1326 words

🎯 Core Idea

When the dataset is huge (e.g., millions of rows) you avoid the closed-form OLS inversion ((X^T X)^{-1} X^T y) because forming and inverting X^T X becomes computationally and numerically expensive. Instead, use iterative / approximate solvers (stochastic gradient descent and its mini-batch variants, conjugate gradient, L-BFGS, variance-reduced methods, sketching) that (a) operate in streaming or low-memory fashion, (b) exploit data sparsity and parallelism, and (c) give good approximate solutions much faster than exact inversion.


🌱 Intuition & Real-World Analogy

  • Think of closed-form OLS as solving a giant jigsaw by first gluing every piece together into one monstrous puzzle (compute X^T X), then carving out the solution by expensive algebraic surgery (matrix inversion). If the puzzle has 100M pieces, you neither have the table nor the glue — you need to assemble the picture incrementally.
  • SGD / mini-batch GD are like painting the picture stroke by stroke: each stroke (a small batch) improves the image; you never need to hold all paints/pieces in memory at once.
  • Sketching is like taking a good snapshot (a compressed summary) of the puzzle that preserves the main structure so you can solve a much smaller problem and still get a decent match.

📐 Mathematical Foundation

Problem: ordinary least squares

$$ \min_{w\in\mathbb{R}^p} \; L(w) = \frac{1}{2n}\sum_{i=1}^{n} (x_i^\top w - y_i)^2 $$

Matrix form:

$$ L(w)=\frac{1}{2n}\|Xw-y\|_2^2. $$

Closed-form (normal equations):

$$ w^\star = (X^\top X)^{-1} X^\top y. $$

Breakdown:

  • $X$ is $n\times p$.
  • Forming $X^\top X$ costs $O(n p^2)$.
  • Inverting a $p\times p$ matrix costs $O(p^3)$ (Cholesky ~ $O(p^3/3)$ for dense).
  • Memory to hold $X$ is $O(np)$ (often impossible when $n$ large).

Iterative solvers — gradient and stochastic gradient:

Full gradient:

$$ \nabla L(w) = \frac{1}{n} X^\top (Xw - y). $$

Gradient descent (batch):

$$ w_{t+1} = w_t - \eta_t \nabla L(w_t). $$

Stochastic gradient (single-sample):

$$ w_{t+1} = w_t - \eta_t\, (x_{i(t)}^\top w_t - y_{i(t)}) x_{i(t)}. $$

Mini-batch of size $b$:

$$ \nabla_b L(w) = \frac{1}{b}\sum_{i\in\mathcal{B}} (x_i^\top w - y_i)x_i. $$

Convergence (high-level):

  • For strongly convex & smooth problems (e.g., ridge regression), gradient descent has linear convergence (error $\propto (1 - c)^{t}$, with condition number $\kappa$ affecting rate).
  • SGD has sublinear convergence in expectation; with diminishing step sizes it achieves $O(1/t)$ rate for strongly convex objectives in expectation (variance of stochastic gradients slows ultimate convergence). (Technical proofs are longer — see references.)

Complexity comparison (big-O, conceptual):

  • Closed-form: form $X^\top X$ — $O(n p^2)$; invert — $O(p^3)$. Total: $O(n p^2 + p^3)$.
  • Batch GD: each iteration $O(n p)$; needs many iterations.
  • Mini-batch SGD: each iteration $O(b p)$ (with $b\ll n$), can converge faster in wall time due to cheaper iterations and parallelism.
  • Conjugate gradient (CG) for least squares (applied to normal equations or directly using Lanczos on $X$) costs $O(k np)$ for $k$ iterations; often $k\ll p$. Preconditioning reduces $k$.
  • Sketching: reduce $n$ to $s$ with $s\approx O(p\log p/\varepsilon^2)$, solve sketched system in $O(s p^2)$ or cheaper depending on method.

Derivation: where $O(n p^2 + p^3)$ comes from

  1. Compute $X^\top X$: sum outer products of rows: cost $O(n p^2)$.
  2. Cholesky/inverse of $p\times p$: $O(p^3)$. Hence total $O(n p^2 + p^3)$. If $n\gg p$ the $n p^2$ term dominates; if $p$ is large, the $p^3$ term dominates. Num stability: forming $X^\top X$ squares the condition number (worse numerical stability). Solving via QR or SVD is more stable but costlier (SVD ~ $O(\min{np^2, n^2p})$).

⚖️ Strengths, Limitations & Trade-offs

Closed-form OLS

  • Strengths

    • Exact solution (up to numerical error).
    • One pass to compute $X^\top X$ (can be done online if $p$ small) then algebra.
  • Limitations

    • Memory: storing $X$ or even performing a pass to compute $X^\top X$ is heavy when $n$ huge.
    • Time: $O(n p^2 + p^3)$ may be prohibitive.
    • Numerical stability: forming normal equations squares condition number; unstable for ill-conditioned features.
  • When it’s OK

    • $p$ small (say hundreds or a few thousands) and $n$ fits I/O constraints; or when you can reliably stream and form $X^\top X$ and $p$ is modest.

Iterative/Approximate methods (SGD, mini-batch, CG, L-BFGS, SVRG, sketching)

  • Strengths

    • Low memory — streaming and out-of-core.
    • Cheap per-iteration cost (especially mini-batch SGD).
    • Parallelizable across examples (data parallel).
    • Methods like SVRG/SAGA give faster convergence via variance reduction.
    • Sketching gives provable worst-case approximation with massive speedups.
  • Limitations

    • Approximate — requires stopping criteria and tolerance decisions.
    • Hyperparameters (learning rate, batch size, preconditioner) matter.
    • SGD suffers from variance; may oscillate around optimum unless step sizes decay.
    • Some sophisticated methods require extra storage (e.g., SAGA stores gradients) or coordination in distributed settings.

Trade-off summary: closed-form = accuracy + heavy resources; iterative = approximate + cheap resources + hyperparameter tuning.


🔍 Variants & Extensions

  • Mini-batch SGD — trade between per-iteration variance and throughput; commonly used in practice for large datasets.
  • Momentum / Nesterov accelerated GD — faster transient behavior, reduces oscillation.
  • Variance-reduced methods: SVRG, SAGA, SAG — combine stochastic cheap steps with occasional correction for variance; faster asymptotic convergence.
  • Quasi-Newton (L-BFGS) — maintains low-rank curvature estimate; good for high-accuracy solutions when $p$ moderate and you can make multiple passes.
  • Conjugate Gradient (CG) — powerful for symmetric positive definite systems (solve normal equations or apply CG to $X$ via matrix–vector products). Often converges in $\le p$ iterations, but in practice much less if eigenvalues cluster.
  • Randomized sketching (random projections, leverage-score sampling, SRHT) — reduce $n$ drastically, then solve smaller problem. Provable approximation bounds.
  • Preconditioning — scale the system so iterative methods converge faster (diagonal/approximate inverse preconditioners).
  • Streaming & reservoir sampling — for model updates in one pass with bounded memory.

🚧 Common Challenges & Pitfalls

  • Thinking “more passes = always better”: SGD can overfit or waste time if learning rate not tuned. Use validation and early stopping.
  • Using normal equations blindly: numerical instability for ill-conditioned $X$. Prefer QR or SVD (if feasible) or ridge regularization to stabilize.
  • Ignoring data I/O and shuffling: in SGD, lack of randomization or poor ordering can slow convergence. Shuffle or use random sampling.
  • Wrong batch size assumptions: tiny $b$ reduces computation but increases gradient variance; very large $b$ approaches batch GD with expensive iterations — choose $b$ to fit hardware (e.g., parallel workers).
  • Poor conditioning and slow convergence: scale/standardize features; add ridge ($\lambda I$) to improve conditioning.
  • Assuming SGD always wins: for small $n$ or when high accuracy is needed, deterministic or quasi-Newton methods often outperform SGD in number of passes.
  • Distributed inconsistency: asynchronous updates (Hogwild!) can be fast but may degrade convergence for dense features; ensure algorithm is suited to sparsity pattern.

✅ Practical Recommendations (Short & Actionable)

  • If $n$ is huge (e.g., 100M rows) and $p$ is moderate:

    • Do NOT compute closed-form OLS unless you have a streaming process to accumulate $X^\top X$ and $p$ is small and you accept numerical risks. Use iterative solvers.
    • Prefer mini-batch SGD or L-BFGS with data streaming (multiple passes allowed) for good practical performance.
    • Consider randomized sketching to compress rows to a smaller representative set, then solve the sketched problem.
    • Add ridge regularization to improve conditioning for iterative methods.
    • Use sensible feature scaling and shuffle data for SGD.
  • If $p$ is also very large (high-dimensional): look into sparse solvers, coordinate descent, or dimensionality reduction (feature hashing, random projections).

Direct answer to probing question:

“If you have 100M rows, would you compute the closed-form OLS solution?” No. Practically you should avoid forming and inverting $X^\top X$ on 100M rows. Use streaming iterative methods (mini-batch SGD, L-BFGS, conjugate gradient with preconditioning) and/or randomized sketching to get accurate solutions with feasible compute and memory.


📚 Reference Pointers (for deeper reading)

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!