Cost Function and Optimization: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

5 min read 1043 words

🎯 Core Idea

Cost Function & Optimization: choose a scalar loss that measures how well your model’s predictions $\hat{y}$ match true targets $y$ (e.g., MSE), then minimize that loss over model parameters using optimization algorithms (e.g., Gradient Descent). The cost function defines the geometry of the problem; the optimizer traverses that landscape to find (local/global) minima.

🌱 Intuition & Real-World Analogies

Why a cost function? — You need a single number that tells you whether your model is “good” or “bad.” Minimizing that number guides learning.
Analogy 1 — Target practice: The cost is distance from bullseye (squared distance = MSE). You adjust aim (parameters) iteratively; how big a correction you make is the learning rate.
Analogy 2 — Slope descent on a mountain: The loss landscape is terrain; gradient descent is walking downhill using the local slope. Too large a stride (large learning rate) and you might tumble past the valley; too small and you crawl forever.

📐 Mathematical Foundation

Mean Squared Error (MSE)

$$ \text{MSE}(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i(\theta))^2 $$

$n$: number of training examples.
$y_i$: true target for example $i$.
$\hat{y}_i(\theta)$: model prediction for example $i$ parameterized by $\theta$.
MSE is the average squared residual (error).

Interpretation of terms

Squaring penalizes large errors more heavily (quadratic penalty).
Division by $n$ gives scale-invariant average; sometimes $\frac{1}{2n}$ is used to simplify gradients.

Mean Absolute Error (MAE)

$$ \text{MAE}(\theta) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i(\theta)| $$

Linear penalty for errors; less sensitive to outliers than MSE.

Gradient Descent (GD) — basic update (batch GD)

Given current parameters $\theta^{(t)}$, learning rate $\eta$:

$$ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \, J(\theta^{(t)}) $$

where $J(\theta)$ is the cost (e.g., MSE) and $\nabla_\theta J(\theta)$ is its gradient vector.

For linear regression with model $\hat{y}_i = \mathbf{x}_i^\top \theta$ and $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top \theta)^2$:

$$ \nabla_\theta J(\theta) = -\frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (y_i - \mathbf{x}_i^\top \theta) $$

(derivation in the collapsible section below).

Convergence criteria

Common stopping rules:

$|\theta^{(t+1)} - \theta^{(t)}| < \epsilon$ (parameter change small).
$|J(\theta^{(t+1)}) - J(\theta^{(t)})| < \delta$ (loss change small).
Gradient norm $|\nabla_\theta J(\theta)| < \tau$.
Fixed maximum number of iterations/epochs.

🔎 Derivation: gradient of MSE for linear regression (short)

Start with $J(\theta)=\frac{1}{n}\sum (y_i - \mathbf{x}_i^\top\theta)^2$. Differentiate w.r.t. $\theta$:

$$ \nabla_\theta J(\theta) = \frac{1}{n}\sum 2 (y_i - \mathbf{x}_i^\top\theta)(- \mathbf{x}_i) = -\frac{2}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta). $$

So update: $\theta \leftarrow \theta + \frac{2\eta}{n}\sum \mathbf{x}_i (y_i - \mathbf{x}_i^\top\theta)$.

⚖️ Strengths, Limitations & Trade-offs

MSE

Strengths

Smooth and differentiable everywhere → convenient for gradient-based methods.
Penalizes large errors strongly; good when large errors are particularly bad.
Leads to convex objective for linear models (globally optimizable).

Limitations

Sensitive to outliers (squared term amplifies them).
Not robust when error distribution has heavy tails.

MAE

Strengths

Robust to outliers (linear penalty).
Lends itself to median estimation (for constant predictor).

Limitations

Non-differentiable at zero residual (subgradient methods needed); can slow optimization with naive gradient methods.
Less smooth objective → slightly harder to optimize with gradient-based optimizers that assume smoothness.

Trade-offs

Choose MSE when you want smooth gradients and penalize large deviations; choose MAE when robustness to outliers matters.
Common compromise: Huber loss — quadratic near 0, linear in tails (smooth + robust).

🔍 Variants & Extensions

Loss variants
- Huber loss (robust + differentiable).
- Log-cosh loss (smooth, behaves like MSE near 0, like MAE for large errors).
Optimization algorithms
- Batch GD (full dataset per update).
- Stochastic GD (SGD) — one sample per update (high noise, frequent updates).
- Mini-batch GD — trade-off between variance and computational efficiency.
- Momentum, Nesterov momentum — accelerate along consistent gradient directions.
- Adaptive optimizers: AdaGrad, RMSProp, Adam — adjust per-parameter learning rates.
Regularization (changes objective)
- $L_2$ (Ridge): $J(\theta) + \lambda |\theta|_2^2$ — reduces variance, useful with multicollinearity.
- $L_1$ (Lasso): $J(\theta) + \lambda |\theta|_1$ — induces sparsity.

🚧 Common Challenges & Pitfalls

Learning rate selection
- Too large → divergence/oscillation.
- Too small → extremely slow convergence or getting trapped in shallow regions.
- Practical: use learning rate schedules (decay), warm restarts, or adaptive optimizers (Adam) — but be aware of their own pitfalls (e.g., Adam may generalize differently).
Poor scaling / feature variance
- Features with very different scales cause ill-conditioned Hessian → slow convergence. Always normalize/standardize features for GD-based methods.
Local minima vs saddle points
- For convex MSE + linear models: global minimum guaranteed. For nonconvex models (deep nets) saddle points and local minima matter.
Overfitting
- Minimizing training loss perfectly (very low MSE) can hurt test performance — use validation, regularization, early stopping.
Noisy gradients (SGD)
- Variance in gradient estimates can slow convergence; use learning rate schedules, momentum, or larger mini-batches.
Interpreting loss magnitude
- Absolute MSE value depends on target scale; compare relative improvements or use normalized metrics (e.g., RMSE, R²).

✅ Practical Tips (concise)

Start with a modest learning rate (e.g., $10^{-3}$ — task dependent). Monitor loss curve.
Scale inputs (zero mean, unit variance) to improve conditioning.
Use mini-batch GD (e.g., 32–512) for efficiency and stable gradients.
Switch optimizers if tuning learning rate is difficult: Adam is robust default; consider SGD+momentum for better generalization in some settings.
Visualize loss vs iterations — watch for divergence (loss shoots up) or plateau (learning rate too small or stuck).
If divergence: reduce learning rate by factor 10; re-run. If too slow, increase by factor 2–10 carefully.

🔁 Probing Question — What happens if your learning rate is too high or too low?

Too high ($\eta$ large): updates overshoot minima; parameters oscillate or diverge; loss can increase or blow up to NaNs. In nonconvex settings, you can jump out of wide basins and miss good minima.
Too low ($\eta$ tiny): extremely slow progress; optimizer may appear stuck (tiny loss improvement per step); can get trapped in flat regions for a long time.
Practical tuning: use learning rate schedules (exponential decay, step decay), cyclical LR, or adaptive methods (Adam, RMSProp). Combine with momentum to damp oscillations if $\eta$ is large.

📚 Reference Pointers

MSE, MAE, and robust alternatives: Wikipedia — Loss function — https://en.wikipedia.org/wiki/Loss_function
Huber loss: Wikipedia — https://en.wikipedia.org/wiki/Huber_loss
Optimization basics & algorithms: Goodfellow, Bengio & Courville — Deep Learning (Ch. 8, Optimization). https://www.deeplearningbook.org/
Convexity & linear regression theory: Bishop, C. M. — Pattern Recognition and Machine Learning. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
Practical SGD tricks (momentum, Adam): Reddi et al., On the Convergence of Adam and Beyond (Adam variants). https://arxiv.org/abs/1904.09237
Regularization effects: Hastie, Tibshirani & Friedman — Elements of Statistical Learning. https://web.stanford.edu/~hastie/ElemStatLearn/

Feature Pipelines: Linear Regression Comparing Linear Regression Variants & Alternatives