Cost Function and Optimization: Linear Regression

5 min read 1043 words

🎯 Core Idea

  • Cost Function & Optimization: choose a scalar loss that measures how well your model’s predictions $\hat{y}$ match true targets $y$ (e.g., MSE), then minimize that loss over model parameters using optimization algorithms (e.g., Gradient Descent). The cost function defines the geometry of the problem; the optimizer traverses that landscape to find (local/global) minima.

🌱 Intuition & Real-World Analogies

  • Why a cost function? β€” You need a single number that tells you whether your model is β€œgood” or β€œbad.” Minimizing that number guides learning.
  • Analogy 1 β€” Target practice: The cost is distance from bullseye (squared distance = MSE). You adjust aim (parameters) iteratively; how big a correction you make is the learning rate.
  • Analogy 2 β€” Slope descent on a mountain: The loss landscape is terrain; gradient descent is walking downhill using the local slope. Too large a stride (large learning rate) and you might tumble past the valley; too small and you crawl forever.

πŸ“ Mathematical Foundation

Mean Squared Error (MSE)

$$ \text{MSE}(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i(\theta))^2 $$
  • $n$: number of training examples.
  • $y_i$: true target for example $i$.
  • $\hat{y}_i(\theta)$: model prediction for example $i$ parameterized by $\theta$.
  • MSE is the average squared residual (error).

Interpretation of terms

  • Squaring penalizes large errors more heavily (quadratic penalty).
  • Division by $n$ gives scale-invariant average; sometimes $\frac{1}{2n}$ is used to simplify gradients.

Mean Absolute Error (MAE)

$$ \text{MAE}(\theta) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i(\theta)| $$
  • Linear penalty for errors; less sensitive to outliers than MSE.

Gradient Descent (GD) β€” basic update (batch GD)

Given current parameters $\theta^{(t)}$, learning rate $\eta$:

$$ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \, J(\theta^{(t)}) $$

where $J(\theta)$ is the cost (e.g., MSE) and $\nabla_\theta J(\theta)$ is its gradient vector.

For linear regression with model $\hat{y}_i = \mathbf{x}_i^\top \theta$ and $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top \theta)^2$:

$$ \nabla_\theta J(\theta) = -\frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (y_i - \mathbf{x}_i^\top \theta) $$

(derivation in the collapsible section below).

Convergence criteria

Common stopping rules:

  • $|\theta^{(t+1)} - \theta^{(t)}| < \epsilon$ (parameter change small).
  • $|J(\theta^{(t+1)}) - J(\theta^{(t)})| < \delta$ (loss change small).
  • Gradient norm $|\nabla_\theta J(\theta)| < \tau$.
  • Fixed maximum number of iterations/epochs.

πŸ”Ž Derivation: gradient of MSE for linear regression (short)

Start with $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top\theta)^2$. Differentiate w.r.t. $\theta$:

$$ \nabla_\theta J(\theta) = \frac{1}{n}\sum 2 (y_i - \mathbf{x}_i^\top\theta)(- \mathbf{x}_i) = -\frac{2}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta). $$

So update: $\theta \leftarrow \theta + \frac{2\eta}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta)$.


βš–οΈ Strengths, Limitations & Trade-offs

MSE

Strengths

  • Smooth and differentiable everywhere β†’ convenient for gradient-based methods.
  • Penalizes large errors strongly; good when large errors are particularly bad.
  • Leads to convex objective for linear models (globally optimizable).

Limitations

  • Sensitive to outliers (squared term amplifies them).
  • Not robust when error distribution has heavy tails.

MAE

Strengths

  • Robust to outliers (linear penalty).
  • Lends itself to median estimation (for constant predictor).

Limitations

  • Non-differentiable at zero residual (subgradient methods needed); can slow optimization with naive gradient methods.
  • Less smooth objective β†’ slightly harder to optimize with gradient-based optimizers that assume smoothness.

Trade-offs

  • Choose MSE when you want smooth gradients and penalize large deviations; choose MAE when robustness to outliers matters.
  • Common compromise: Huber loss β€” quadratic near 0, linear in tails (smooth + robust).

πŸ” Variants & Extensions

  • Loss variants

    • Huber loss (robust + differentiable).
    • Log-cosh loss (smooth, behaves like MSE near 0, like MAE for large errors).
  • Optimization algorithms

    • Batch GD (full dataset per update).
    • Stochastic GD (SGD) β€” one sample per update (high noise, frequent updates).
    • Mini-batch GD β€” trade-off between variance and computational efficiency.
    • Momentum, Nesterov momentum β€” accelerate along consistent gradient directions.
    • Adaptive optimizers: AdaGrad, RMSProp, Adam β€” adjust per-parameter learning rates.
  • Regularization (changes objective)

    • $L_2$ (Ridge): $J(\theta) + \lambda |\theta|_2^2$ β€” reduces variance, useful with multicollinearity.
    • $L_1$ (Lasso): $J(\theta) + \lambda |\theta|_1$ β€” induces sparsity.

🚧 Common Challenges & Pitfalls

  • Learning rate selection

    • Too large β†’ divergence/oscillation.
    • Too small β†’ extremely slow convergence or getting trapped in shallow regions.
    • Practical: use learning rate schedules (decay), warm restarts, or adaptive optimizers (Adam) β€” but be aware of their own pitfalls (e.g., Adam may generalize differently).
  • Poor scaling / feature variance

    • Features with very different scales cause ill-conditioned Hessian β†’ slow convergence. Always normalize/standardize features for GD-based methods.
  • Local minima vs saddle points

    • For convex MSE + linear models: global minimum guaranteed. For nonconvex models (deep nets) saddle points and local minima matter.
  • Overfitting

    • Minimizing training loss perfectly (very low MSE) can hurt test performance β€” use validation, regularization, early stopping.
  • Noisy gradients (SGD)

    • Variance in gradient estimates can slow convergence; use learning rate schedules, momentum, or larger mini-batches.
  • Interpreting loss magnitude

    • Absolute MSE value depends on target scale; compare relative improvements or use normalized metrics (e.g., RMSE, RΒ²).

βœ… Practical Tips (concise)

  • Start with a modest learning rate (e.g., $10^{-3}$ β€” task dependent). Monitor loss curve.
  • Scale inputs (zero mean, unit variance) to improve conditioning.
  • Use mini-batch GD (e.g., 32–512) for efficiency and stable gradients.
  • Switch optimizers if tuning learning rate is difficult: Adam is robust default; consider SGD+momentum for better generalization in some settings.
  • Visualize loss vs iterations β€” watch for divergence (loss shoots up) or plateau (learning rate too small or stuck).
  • If divergence: reduce learning rate by factor 10; re-run. If too slow, increase by factor 2–10 carefully.

πŸ” Probing Question β€” What happens if your learning rate is too high or too low?

  • Too high ($\eta$ large): updates overshoot minima; parameters oscillate or diverge; loss can increase or blow up to NaNs. In nonconvex settings, you can jump out of wide basins and miss good minima.
  • Too low ($\eta$ tiny): extremely slow progress; optimizer may appear stuck (tiny loss improvement per step); can get trapped in flat regions for a long time.
  • Practical tuning: use learning rate schedules (exponential decay, step decay), cyclical LR, or adaptive methods (Adam, RMSProp). Combine with momentum to damp oscillations if $\eta$ is large.

πŸ“š Reference Pointers

Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!