Comparing Linear Regression Variants & Alternatives

8 min read 1529 words

L1 (Lasso) vs. L2 (Ridge) Regularization

  • Choose L1 (Lasso) when: You suspect many features are irrelevant and want a sparse model with automatic feature selection.
  • Choose L2 (Ridge) when: You believe features are all somewhat informative and you want to shrink coefficients to improve generalization while keeping them dense.
🅰️ L1 (Lasso)
  • Encourage sparsity by penalizing absolute coefficient magnitude; goal is both regularization and variable selection.

🅱️ L2 (Ridge)
  • Penalize squared coefficient magnitude to shrink weights smoothly and reduce variance without forcing exact zeros.
🅰️ L1 (Lasso)
  • Introduces a non-differentiable kink at zero; optimization (coordinate descent, sub-gradient methods) can push some coefficients exactly to zero.

🅱️ L2 (Ridge)
  • Adds a smooth convex quadratic penalty; gradients remain linear in coefficients so closed-form (or standard solvers) work cleanly and no exact zeros are produced.
🅰️ L1 (Lasso)
  • Sparsity: Yes — encourages exact zeros because the absolute penalty has corners that align with axes.
  • Use Case: High-dimensional problems (p ≫ n) or when model interpretability / feature selection matters.

🅱️ L2 (Ridge)
  • Sparsity: No — produces dense solutions where weights are shrunk but typically non-zero.
  • Use Case: When multicollinearity exists or you want stable, low-variance estimates and all features are expected to contribute.
🅰️ L1 (Lasso)
  • Penalty Term: $\lambda \sum_{i=1}^{n} |\theta_i|$

🅱️ L2 (Ridge)
  • Penalty Term: $\lambda \sum_{i=1}^{n} \theta_i^2$

Normal Equation (Closed-Form) vs Gradient Descent (Iterative)

  • Choose Normal Equation when: Feature count is small-to-moderate and you prefer an exact solution with minimal tuning.
  • Choose Gradient Descent when: You have many features, lots of data, or need iterative/online updates and memory/performance constraints prevent matrix inversion.
🅰️ Normal Equation
  • Compute the analytic minimizer for ordinary least squares; returns exact solution (within numerical precision) in one computation.

🅱️ Gradient Descent
  • Iteratively search parameter space with updates following negative gradient to converge to a (local/global) minimum.
🅰️ Normal Equation
  • Forms and inverts $X^T X$; no hyperparameters like learning rate but vulnerable to singular/ill-conditioned matrices.

🅱️ Gradient Descent
  • Requires learning rate (and possibly momentum, scheduling); scales to large datasets via mini-batching and stochastic variants.
🅰️ Normal Equation
  • Sparsity: Not applicable — provides dense solution unless paired with explicit regularization that yields sparsity.
  • Use Case: Small-to-medium feature counts (n up to a few thousands, depending on memory); quick one-shot solution.

🅱️ Gradient Descent
  • Sparsity: Not inherently sparse (unless combined with L1 penalty).
  • Use Case: Very large n or m, streaming data, or when you need to incorporate complex objectives and constraints incrementally.
🅰️ Normal Equation
  • Core Equation: $\hat{\theta} = (X^\top X)^{-1} X^\top y$ (use pseudo-inverse or regularization when $X^\top X$ is singular)

🅱️ Gradient Descent
  • Update Rule (batch GD): $\theta \leftarrow \theta - \alpha \cdot \frac{1}{m} X^\top (X\theta - y)$ where $\alpha$ is the learning rate.

Batch Gradient Descent vs Stochastic / Mini-batch Gradient Descent

  • Choose Batch GD when: You can afford full-batch computations and prefer stable, smooth convergence for small datasets.
  • Choose Stochastic/Mini-batch GD when: You need scalability, faster per-update progress, or better ability to escape shallow local plateaus.
🅰️ Batch Gradient Descent
  • Use the entire dataset each update to compute the exact gradient direction for stable descent.

🅱️ Stochastic / Mini-batch GD
  • Use one (stochastic) or a small random subset (mini-batch) per update to get noisy but frequent gradient signals, improving wall-clock speed and generalization sometimes.
🅰️ Batch Gradient Descent
  • Produces low-variance, deterministic updates but each step is expensive: requires full pass over data.

🅱️ Stochastic / Mini-batch GD
  • Noisy updates can help jump out of shallow basins and often reach good solutions faster in practice; requires learning-rate tuning and may need averaging / momentum.
🅰️ Batch Gradient Descent
  • Sparsity: Not affected by batch choice.
  • Use Case: Small datasets where exact gradient per step is affordable and deterministic convergence is preferred.

🅱️ Stochastic / Mini-batch GD
  • Sparsity: Not affected directly.
  • Use Case: Large-scale learning, online learning, or when you want faster iterations and to leverage hardware (GPU/TPU) via vectorized mini-batches.
🅰️ Batch Gradient Descent
  • Core Update (batch): $\theta \leftarrow \theta - \alpha \cdot \frac{1}{m} X^\top (X\theta - y)$

🅱️ Stochastic / Mini-batch GD
  • Core Update (stochastic for sample i): $\theta \leftarrow \theta - \alpha \cdot x^{(i)} (x^{(i)\top}\theta - y^{(i)})$ Mini-batch (size b): average gradient over b samples in the batch.

Ordinary Least Squares (OLS) vs Robust Regression (Huber / RANSAC)

  • Choose OLS when: Errors are roughly Gaussian and outliers are rare — OLS is efficient and unbiased under classic assumptions.
  • Choose Robust methods when: The dataset contains outliers or heavy-tailed noise that would unduly skew the OLS estimator.
🅰️ OLS (Ordinary Least Squares)
  • Minimize squared residuals to get the best linear unbiased estimator under Gauss–Markov assumptions.

🅱️ Robust Regression (Huber / RANSAC)
  • Aim to reduce influence of outliers by using alternative loss functions (Huber) or consensus-based fits (RANSAC).
🅰️ OLS
  • Penalizes squared errors; large residuals have quadratic influence, so outliers can dominate parameter estimates.

🅱️ Robust Regression
  • Huber: quadratic for small residuals and linear for large ones (smoothly reduces outlier influence).
  • RANSAC: fits repeatedly on random subsets and selects the model with the most inliers (resilient to many outliers).
🅰️ OLS
  • Sparsity: No. OLS produces dense parameter estimates.
  • Use Case: Clean datasets that satisfy linear model assumptions (homoscedasticity, no influential outliers).

🅱️ Robust Regression
  • Sparsity: No (unless combined with L1).
  • Use Case: Datasets with measurement errors, label noise, or adversarial/outlier points where OLS is unreliable.
🅰️ OLS
  • Objective (MSE): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2$

🅱️ Robust Regression
  • Huber Loss:
$$$L_\delta(r) = \begin{cases} \tfrac{1}{2}r^2 & \text{if } |r| \le \delta \\ \delta (|r| - \tfrac{1}{2}\delta) & \text{otherwise} \end{cases} \quad\text{where } r = y - x^\top\theta $$

$$$

  • RANSAC (conceptual): Repeatedly fit model to random subsets and maximize inlier count (no simple closed-form loss).

Polynomial Feature Expansion vs Kernel Methods

  • Choose Polynomial Expansion when: Dimensionality after expansion is manageable and you want explicit feature representations for inspection and downstream use.
  • Choose Kernel Methods when: The feature space is implicitly huge (or infinite) and you want non-linear expressiveness without constructing explicit high-dimensional features.
🅰️ Polynomial Feature Expansion
  • Build explicit higher-order features (e.g., $x, x^2, x^3, x_i x_j$) to allow a linear model to fit non-linear relationships.

🅱️ Kernel Methods
  • Use a kernel function to compute inner products in an implicit feature space, enabling non-linear decision surfaces without explicit expansion.
🅰️ Polynomial Feature Expansion
  • Changes parameterization: you solve linear regression on enlarged feature set; complexity grows with combinatorial number of terms (degree & interactions).

🅱️ Kernel Methods
  • Influence is through kernel matrix $K$ (size m×m). Training depends on solving systems involving $K$, enabling high expressiveness but with computational cost tied to samples, not explicit feature dimension.
🅰️ Polynomial Feature Expansion
  • Sparsity: Not inherent; feature counts explode and may cause multicollinearity — use regularization to combat overfitting.
  • Use Case: Low-dimensional inputs where degree can be kept small and explicit transformed features are useful for interpretation.

🅱️ Kernel Methods
  • Sparsity: Not directly; can combine with sparse kernel approximations (e.g., Nyström, random features).
  • Use Case: When implicit high-dimensional mappings are needed (e.g., smooth non-linearities) and sample count m is moderate.
🅰️ Polynomial Feature Expansion
  • Example (degree d) features: build $[x, x^2, \dots, x^d]$ and interactions; objective remains: $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - \phi(x^{(i)})^\top \theta)^2$ where $\phi(x)$ is explicit polynomial basis.

🅱️ Kernel Methods
  • Kernel ridge (dual) objective (example): $\min_\alpha ; \frac{1}{m}|y - K\alpha|^2 + \lambda \alpha^\top K \alpha$ with predictions $\hat{y}(x) = k(x)^\top \alpha$, where $k(x)$ is vector of kernel evaluations with training points.

Bias–Variance Trade-off vs Regularization

  • Think of Bias–Variance as: the conceptual diagnostic describing underfitting vs overfitting.
  • Think of Regularization as: a practical control knob to shift that trade-off toward lower variance (at the cost of added bias).
🅰️ Bias–Variance Trade-off
  • Explain the decomposition of expected error into bias² (error from wrong model assumptions), variance (sensitivity to training set noise), and irreducible noise.

🅱️ Regularization
  • Use penalty terms to intentionally increase bias slightly while reducing variance, often improving generalization on unseen data.
🅰️ Bias–Variance Trade-off
  • High-capacity models → low bias, high variance; low-capacity models → high bias, low variance. Goal: find sweet spot (model complexity + data).

🅱️ Regularization
  • Mechanism: shrink coefficients (L2) or select features (L1) to reduce model complexity effectively, thus decreasing variance.
🅰️ Bias–Variance Trade-off
  • Sparsity: Conceptual trade-off doesn’t directly impose sparsity.
  • Use Case: Diagnostic framework to decide model complexity, data collection, and validation strategies.

🅱️ Regularization
  • Sparsity: L1 can induce sparsity; L2 does not.
  • Use Case: Regularization is the practical tool applied when variance (overfitting) is observed on validation data.
🅰️ Bias–Variance Trade-off
  • Decomposition (expected squared error): $\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$ where $\sigma^2$ is irreducible noise.

🅱️ Regularization
  • Example (Ridge-regularized objective): $\min_\theta ; \frac{1}{m}\sum_{i=1}^m (y^{(i)} - x^{(i)\top}\theta)^2 + \lambda |\theta|_2^2$ Regularization term $\lambda$ trades bias for variance in a controllable way.
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!