Comparing Linear Regression Variants & Alternatives
Comparing Linear Regression Variants & Alternatives
8 min read
1529 words
L1 (Lasso) vs. L2 (Ridge) Regularization
- Choose L1 (Lasso) when: You suspect many features are irrelevant and want a sparse model with automatic feature selection.
- Choose L2 (Ridge) when: You believe features are all somewhat informative and you want to shrink coefficients to improve generalization while keeping them dense.
🅰️ L1 (Lasso)
- Encourage sparsity by penalizing absolute coefficient magnitude; goal is both regularization and variable selection.
🅱️ L2 (Ridge)
- Penalize squared coefficient magnitude to shrink weights smoothly and reduce variance without forcing exact zeros.
🅰️ L1 (Lasso)
- Introduces a non-differentiable kink at zero; optimization (coordinate descent, sub-gradient methods) can push some coefficients exactly to zero.
🅱️ L2 (Ridge)
- Adds a smooth convex quadratic penalty; gradients remain linear in coefficients so closed-form (or standard solvers) work cleanly and no exact zeros are produced.
🅰️ L1 (Lasso)
- Sparsity: Yes — encourages exact zeros because the absolute penalty has corners that align with axes.
- Use Case: High-dimensional problems (p ≫ n) or when model interpretability / feature selection matters.
🅱️ L2 (Ridge)
- Sparsity: No — produces dense solutions where weights are shrunk but typically non-zero.
- Use Case: When multicollinearity exists or you want stable, low-variance estimates and all features are expected to contribute.
Normal Equation (Closed-Form) vs Gradient Descent (Iterative)
- Choose Normal Equation when: Feature count is small-to-moderate and you prefer an exact solution with minimal tuning.
- Choose Gradient Descent when: You have many features, lots of data, or need iterative/online updates and memory/performance constraints prevent matrix inversion.
🅰️ Normal Equation
- Compute the analytic minimizer for ordinary least squares; returns exact solution (within numerical precision) in one computation.
🅱️ Gradient Descent
- Iteratively search parameter space with updates following negative gradient to converge to a (local/global) minimum.
🅰️ Normal Equation
- Forms and inverts $X^T X$; no hyperparameters like learning rate but vulnerable to singular/ill-conditioned matrices.
🅱️ Gradient Descent
- Requires learning rate (and possibly momentum, scheduling); scales to large datasets via mini-batching and stochastic variants.
🅰️ Normal Equation
- Sparsity: Not applicable — provides dense solution unless paired with explicit regularization that yields sparsity.
- Use Case: Small-to-medium feature counts (n up to a few thousands, depending on memory); quick one-shot solution.
🅱️ Gradient Descent
- Sparsity: Not inherently sparse (unless combined with L1 penalty).
- Use Case: Very large n or m, streaming data, or when you need to incorporate complex objectives and constraints incrementally.
Batch Gradient Descent vs Stochastic / Mini-batch Gradient Descent
- Choose Batch GD when: You can afford full-batch computations and prefer stable, smooth convergence for small datasets.
- Choose Stochastic/Mini-batch GD when: You need scalability, faster per-update progress, or better ability to escape shallow local plateaus.
🅰️ Batch Gradient Descent
- Use the entire dataset each update to compute the exact gradient direction for stable descent.
🅱️ Stochastic / Mini-batch GD
- Use one (stochastic) or a small random subset (mini-batch) per update to get noisy but frequent gradient signals, improving wall-clock speed and generalization sometimes.
🅰️ Batch Gradient Descent
- Produces low-variance, deterministic updates but each step is expensive: requires full pass over data.
🅱️ Stochastic / Mini-batch GD
- Noisy updates can help jump out of shallow basins and often reach good solutions faster in practice; requires learning-rate tuning and may need averaging / momentum.
🅰️ Batch Gradient Descent
- Sparsity: Not affected by batch choice.
- Use Case: Small datasets where exact gradient per step is affordable and deterministic convergence is preferred.
🅱️ Stochastic / Mini-batch GD
- Sparsity: Not affected directly.
- Use Case: Large-scale learning, online learning, or when you want faster iterations and to leverage hardware (GPU/TPU) via vectorized mini-batches.
🅰️ Batch Gradient Descent
- Core Update (batch): $\theta \leftarrow \theta - \alpha \cdot \frac{1}{m} X^\top (X\theta - y)$
🅱️ Stochastic / Mini-batch GD
- Core Update (stochastic for sample i): $\theta \leftarrow \theta - \alpha \cdot x^{(i)} (x^{(i)\top}\theta - y^{(i)})$ Mini-batch (size b): average gradient over b samples in the batch.
Ordinary Least Squares (OLS) vs Robust Regression (Huber / RANSAC)
- Choose OLS when: Errors are roughly Gaussian and outliers are rare — OLS is efficient and unbiased under classic assumptions.
- Choose Robust methods when: The dataset contains outliers or heavy-tailed noise that would unduly skew the OLS estimator.
🅰️ OLS (Ordinary Least Squares)
- Minimize squared residuals to get the best linear unbiased estimator under Gauss–Markov assumptions.
🅱️ Robust Regression (Huber / RANSAC)
- Aim to reduce influence of outliers by using alternative loss functions (Huber) or consensus-based fits (RANSAC).
🅰️ OLS
- Penalizes squared errors; large residuals have quadratic influence, so outliers can dominate parameter estimates.
🅱️ Robust Regression
- Huber: quadratic for small residuals and linear for large ones (smoothly reduces outlier influence).
- RANSAC: fits repeatedly on random subsets and selects the model with the most inliers (resilient to many outliers).
🅰️ OLS
- Sparsity: No. OLS produces dense parameter estimates.
- Use Case: Clean datasets that satisfy linear model assumptions (homoscedasticity, no influential outliers).
🅱️ Robust Regression
- Sparsity: No (unless combined with L1).
- Use Case: Datasets with measurement errors, label noise, or adversarial/outlier points where OLS is unreliable.
🅰️ OLS
- Objective (MSE): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2$
🅱️ Robust Regression
- Huber Loss:
$$$
- RANSAC (conceptual): Repeatedly fit model to random subsets and maximize inlier count (no simple closed-form loss).
Polynomial Feature Expansion vs Kernel Methods
- Choose Polynomial Expansion when: Dimensionality after expansion is manageable and you want explicit feature representations for inspection and downstream use.
- Choose Kernel Methods when: The feature space is implicitly huge (or infinite) and you want non-linear expressiveness without constructing explicit high-dimensional features.
🅰️ Polynomial Feature Expansion
- Build explicit higher-order features (e.g., $x, x^2, x^3, x_i x_j$) to allow a linear model to fit non-linear relationships.
🅱️ Kernel Methods
- Use a kernel function to compute inner products in an implicit feature space, enabling non-linear decision surfaces without explicit expansion.
🅰️ Polynomial Feature Expansion
- Changes parameterization: you solve linear regression on enlarged feature set; complexity grows with combinatorial number of terms (degree & interactions).
🅱️ Kernel Methods
- Influence is through kernel matrix $K$ (size m×m). Training depends on solving systems involving $K$, enabling high expressiveness but with computational cost tied to samples, not explicit feature dimension.
🅰️ Polynomial Feature Expansion
- Sparsity: Not inherent; feature counts explode and may cause multicollinearity — use regularization to combat overfitting.
- Use Case: Low-dimensional inputs where degree can be kept small and explicit transformed features are useful for interpretation.
🅱️ Kernel Methods
- Sparsity: Not directly; can combine with sparse kernel approximations (e.g., Nyström, random features).
- Use Case: When implicit high-dimensional mappings are needed (e.g., smooth non-linearities) and sample count m is moderate.
🅰️ Polynomial Feature Expansion
- Example (degree d) features: build $[x, x^2, \dots, x^d]$ and interactions; objective remains: $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - \phi(x^{(i)})^\top \theta)^2$ where $\phi(x)$ is explicit polynomial basis.
🅱️ Kernel Methods
- Kernel ridge (dual) objective (example): $\min_\alpha ; \frac{1}{m}|y - K\alpha|^2 + \lambda \alpha^\top K \alpha$ with predictions $\hat{y}(x) = k(x)^\top \alpha$, where $k(x)$ is vector of kernel evaluations with training points.
Bias–Variance Trade-off vs Regularization
- Think of Bias–Variance as: the conceptual diagnostic describing underfitting vs overfitting.
- Think of Regularization as: a practical control knob to shift that trade-off toward lower variance (at the cost of added bias).
🅰️ Bias–Variance Trade-off
- Explain the decomposition of expected error into bias² (error from wrong model assumptions), variance (sensitivity to training set noise), and irreducible noise.
🅱️ Regularization
- Use penalty terms to intentionally increase bias slightly while reducing variance, often improving generalization on unseen data.
🅰️ Bias–Variance Trade-off
- High-capacity models → low bias, high variance; low-capacity models → high bias, low variance. Goal: find sweet spot (model complexity + data).
🅱️ Regularization
- Mechanism: shrink coefficients (L2) or select features (L1) to reduce model complexity effectively, thus decreasing variance.
🅰️ Bias–Variance Trade-off
- Sparsity: Conceptual trade-off doesn’t directly impose sparsity.
- Use Case: Diagnostic framework to decide model complexity, data collection, and validation strategies.
🅱️ Regularization
- Sparsity: L1 can induce sparsity; L2 does not.
- Use Case: Regularization is the practical tool applied when variance (overfitting) is observed on validation data.
🅰️ Bias–Variance Trade-off
- Decomposition (expected squared error): $\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$ where $\sigma^2$ is irreducible noise.
🅱️ Regularization
- Example (Ridge-regularized objective): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2 + \lambda |\theta|_2^2$ Regularization term $\lambda$ trades bias for variance in a controllable way.