Cost Function and Optimization: Linear Regression
π― Core Idea
- Cost Function & Optimization: choose a scalar loss that measures how well your modelβs predictions $\hat{y}$ match true targets $y$ (e.g., MSE), then minimize that loss over model parameters using optimization algorithms (e.g., Gradient Descent). The cost function defines the geometry of the problem; the optimizer traverses that landscape to find (local/global) minima.
π± Intuition & Real-World Analogies
- Why a cost function? β You need a single number that tells you whether your model is βgoodβ or βbad.β Minimizing that number guides learning.
- Analogy 1 β Target practice: The cost is distance from bullseye (squared distance = MSE). You adjust aim (parameters) iteratively; how big a correction you make is the learning rate.
- Analogy 2 β Slope descent on a mountain: The loss landscape is terrain; gradient descent is walking downhill using the local slope. Too large a stride (large learning rate) and you might tumble past the valley; too small and you crawl forever.
π Mathematical Foundation
Mean Squared Error (MSE)
$$ \text{MSE}(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i(\theta))^2 $$- $n$: number of training examples.
- $y_i$: true target for example $i$.
- $\hat{y}_i(\theta)$: model prediction for example $i$ parameterized by $\theta$.
- MSE is the average squared residual (error).
Interpretation of terms
- Squaring penalizes large errors more heavily (quadratic penalty).
- Division by $n$ gives scale-invariant average; sometimes $\frac{1}{2n}$ is used to simplify gradients.
Mean Absolute Error (MAE)
$$ \text{MAE}(\theta) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i(\theta)| $$- Linear penalty for errors; less sensitive to outliers than MSE.
Gradient Descent (GD) β basic update (batch GD)
Given current parameters $\theta^{(t)}$, learning rate $\eta$:
$$ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \, J(\theta^{(t)}) $$where $J(\theta)$ is the cost (e.g., MSE) and $\nabla_\theta J(\theta)$ is its gradient vector.
For linear regression with model $\hat{y}_i = \mathbf{x}_i^\top \theta$ and $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top \theta)^2$:
$$ \nabla_\theta J(\theta) = -\frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (y_i - \mathbf{x}_i^\top \theta) $$(derivation in the collapsible section below).
Convergence criteria
Common stopping rules:
- $|\theta^{(t+1)} - \theta^{(t)}| < \epsilon$ (parameter change small).
- $|J(\theta^{(t+1)}) - J(\theta^{(t)})| < \delta$ (loss change small).
- Gradient norm $|\nabla_\theta J(\theta)| < \tau$.
- Fixed maximum number of iterations/epochs.
π Derivation: gradient of MSE for linear regression (short)
Start with $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top\theta)^2$. Differentiate w.r.t. $\theta$:
$$ \nabla_\theta J(\theta) = \frac{1}{n}\sum 2 (y_i - \mathbf{x}_i^\top\theta)(- \mathbf{x}_i) = -\frac{2}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta). $$So update: $\theta \leftarrow \theta + \frac{2\eta}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta)$.
βοΈ Strengths, Limitations & Trade-offs
MSE
Strengths
- Smooth and differentiable everywhere β convenient for gradient-based methods.
- Penalizes large errors strongly; good when large errors are particularly bad.
- Leads to convex objective for linear models (globally optimizable).
Limitations
- Sensitive to outliers (squared term amplifies them).
- Not robust when error distribution has heavy tails.
MAE
Strengths
- Robust to outliers (linear penalty).
- Lends itself to median estimation (for constant predictor).
Limitations
- Non-differentiable at zero residual (subgradient methods needed); can slow optimization with naive gradient methods.
- Less smooth objective β slightly harder to optimize with gradient-based optimizers that assume smoothness.
Trade-offs
- Choose MSE when you want smooth gradients and penalize large deviations; choose MAE when robustness to outliers matters.
- Common compromise: Huber loss β quadratic near 0, linear in tails (smooth + robust).
π Variants & Extensions
-
Loss variants
- Huber loss (robust + differentiable).
- Log-cosh loss (smooth, behaves like MSE near 0, like MAE for large errors).
-
Optimization algorithms
- Batch GD (full dataset per update).
- Stochastic GD (SGD) β one sample per update (high noise, frequent updates).
- Mini-batch GD β trade-off between variance and computational efficiency.
- Momentum, Nesterov momentum β accelerate along consistent gradient directions.
- Adaptive optimizers: AdaGrad, RMSProp, Adam β adjust per-parameter learning rates.
-
Regularization (changes objective)
- $L_2$ (Ridge): $J(\theta) + \lambda |\theta|_2^2$ β reduces variance, useful with multicollinearity.
- $L_1$ (Lasso): $J(\theta) + \lambda |\theta|_1$ β induces sparsity.
π§ Common Challenges & Pitfalls
-
Learning rate selection
- Too large β divergence/oscillation.
- Too small β extremely slow convergence or getting trapped in shallow regions.
- Practical: use learning rate schedules (decay), warm restarts, or adaptive optimizers (Adam) β but be aware of their own pitfalls (e.g., Adam may generalize differently).
-
Poor scaling / feature variance
- Features with very different scales cause ill-conditioned Hessian β slow convergence. Always normalize/standardize features for GD-based methods.
-
Local minima vs saddle points
- For convex MSE + linear models: global minimum guaranteed. For nonconvex models (deep nets) saddle points and local minima matter.
-
Overfitting
- Minimizing training loss perfectly (very low MSE) can hurt test performance β use validation, regularization, early stopping.
-
Noisy gradients (SGD)
- Variance in gradient estimates can slow convergence; use learning rate schedules, momentum, or larger mini-batches.
-
Interpreting loss magnitude
- Absolute MSE value depends on target scale; compare relative improvements or use normalized metrics (e.g., RMSE, RΒ²).
β Practical Tips (concise)
- Start with a modest learning rate (e.g., $10^{-3}$ β task dependent). Monitor loss curve.
- Scale inputs (zero mean, unit variance) to improve conditioning.
- Use mini-batch GD (e.g., 32β512) for efficiency and stable gradients.
- Switch optimizers if tuning learning rate is difficult: Adam is robust default; consider SGD+momentum for better generalization in some settings.
- Visualize loss vs iterations β watch for divergence (loss shoots up) or plateau (learning rate too small or stuck).
- If divergence: reduce learning rate by factor 10; re-run. If too slow, increase by factor 2β10 carefully.
π Probing Question β What happens if your learning rate is too high or too low?
- Too high ($\eta$ large): updates overshoot minima; parameters oscillate or diverge; loss can increase or blow up to NaNs. In nonconvex settings, you can jump out of wide basins and miss good minima.
- Too low ($\eta$ tiny): extremely slow progress; optimizer may appear stuck (tiny loss improvement per step); can get trapped in flat regions for a long time.
- Practical tuning: use learning rate schedules (exponential decay, step decay), cyclical LR, or adaptive methods (Adam, RMSProp). Combine with momentum to damp oscillations if $\eta$ is large.
π Reference Pointers
- MSE, MAE, and robust alternatives: Wikipedia β Loss function β https://en.wikipedia.org/wiki/Loss_function
- Huber loss: Wikipedia β https://en.wikipedia.org/wiki/Huber_loss
- Optimization basics & algorithms: Goodfellow, Bengio & Courville β Deep Learning (Ch. 8, Optimization). https://www.deeplearningbook.org/
- Convexity & linear regression theory: Bishop, C. M. β Pattern Recognition and Machine Learning. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
- Practical SGD tricks (momentum, Adam): Reddi et al., On the Convergence of Adam and Beyond (Adam variants). https://arxiv.org/abs/1904.09237
- Regularization effects: Hastie, Tibshirani & Friedman β Elements of Statistical Learning. https://web.stanford.edu/~hastie/ElemStatLearn/