Linear Models - Loss Functions

5 min read 966 words

Understanding loss functions is one of the most decisive indicators of whether a candidate truly grasps how machine learning models learn.

In interviews, questions around loss functions test your mathematical maturity, intuition about optimization, and ability to choose the right objective for the task (regression vs classification, robust vs sensitive, convex vs non-convex).

You’re expected to connect equations → gradients → decision boundaries → real-world behavior.

🤖 Core Regression Loss Functions

1️⃣ Mean Squared Error (MSE)

Note

The Top Tech Company Angle (MSE):
This is the default cost function for regression problems. Interviewers use it to evaluate your understanding of convex optimization, gradient computation, and why squaring errors provides differentiability advantages but poor robustness to outliers.

Learning Steps

Understand the formulation:
\( \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \)
Derive the gradient of MSE with respect to model weights.
Practice differentiating the loss with respect to \( \beta \):
\( \frac{\partial L}{\partial \beta} = -\frac{2}{n} X^T(y - X\beta) \)
Visualize the loss surface — note how convexity ensures a single global minimum.
Implement from scratch using NumPy’s matrix operations. Confirm gradient descent convergence.

Deeper Insight:
MSE penalizes large errors more severely. Discuss why this is beneficial for normally distributed residuals but harmful when data includes extreme outliers.
Probing Question: “If your model overfits, how would regularization modify the MSE objective?”

2️⃣ Mean Absolute Error (MAE)

Note

The Top Tech Company Angle (MAE):
MAE evaluates robustness and introduces non-differentiability at zero, testing your understanding of optimization beyond smooth convex functions.

Learning Steps

Write the loss expression:
\( \text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \)
Explore why MAE uses median as the optimal estimator under L1 norm.
Discuss subgradients and how optimizers handle non-differentiability.
Compare convergence behavior vs. MSE under gradient-based updates.

Deeper Insight:
MAE is resistant to outliers but leads to slower convergence due to constant gradients.
Probing Question: “Why do some implementations use Smooth-L1 loss as a compromise?”

3️⃣ Root Mean Squared Error (RMSE)

Note

The Top Tech Company Angle (RMSE):
RMSE measures error in the same unit as the target variable, making it more interpretable. It’s often used in evaluation metrics discussions, testing whether you can reason about scale sensitivity.

Learning Steps

Understand RMSE:
\( \text{RMSE} = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2} \)
Recognize it as the square root of MSE and connect their gradients.
Discuss interpretability trade-offs — same metric units vs. differentiability concerns.

Deeper Insight:
RMSE emphasizes large errors even more than MSE due to the non-linear transformation.
Probing Question: “Why is RMSE preferred in competitions like Kaggle, but not in parameter optimization?”

4️⃣ Huber Loss

Note

The Top Tech Company Angle (Huber Loss):
Huber Loss is a hybrid — quadratic near zero (like MSE) and linear beyond a threshold (like MAE). It tests whether you can reason about robust optimization and tunable sensitivity.

Learning Steps

Study the formulation:
\( L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2, & \text{if } |y - \hat{y}| \le \delta \\ \delta(|y - \hat{y}| - \frac{1}{2}\delta), & \text{otherwise} \end{cases} \)
Visualize how the curve transitions smoothly from quadratic to linear.
Implement and experiment with different δ values to balance robustness and sensitivity.
Relate to real-world use cases (e.g., noisy data with occasional large errors).

Deeper Insight:
Huber combines smoothness with robustness — an excellent talking point for optimization trade-offs.
Probing Question: “How does δ act as a hyperparameter controlling bias-variance trade-off?”

🧠 Core Classification Loss Functions

5️⃣ Log Loss (Binary Cross-Entropy)

Note

The Top Tech Company Angle (Log Loss):
This is the lifeblood of logistic regression. Interviewers use it to test your understanding of probabilistic modeling, sigmoid transformation, and likelihood maximization.

Learning Steps

Recall logistic regression output: \( \hat{y} = \sigma(X\beta) = \frac{1}{1 + e^{-X\beta}} \)
Define the loss:
\( L = -\frac{1}{n}\sum [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)] \)
Derive gradients and show how it corresponds to maximizing log-likelihood.
Plot loss vs. probability — observe how confident misclassifications are heavily penalized.

Deeper Insight:
Cross-entropy punishes overconfident wrong predictions, ensuring probabilistic calibration.
Probing Question: “Why can’t we use MSE for logistic regression?”

6️⃣ Categorical Cross-Entropy

Note

The Top Tech Company Angle (Categorical Cross-Entropy):
Evaluates your grasp of multiclass classification, softmax outputs, and information-theoretic principles.

Learning Steps

Define softmax output:
\( \hat{y}_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} \)
Write the loss:
\( L = -\sum y_i \log(\hat{y}_i) \)
Explain why this measures divergence between true and predicted distributions.
Implement from scratch to demonstrate numerical stability using log-sum-exp trick.

Deeper Insight:
This connects to information theory (minimizing KL divergence).
Probing Question: “Why does the softmax’s normalization ensure numerical stability and gradient flow?”

⚔️ Margin-Based Loss Function

7️⃣ Hinge Loss

Note

The Top Tech Company Angle (Hinge Loss):
Used in SVMs and linear classifiers, this loss tests your ability to reason about margin-based optimization and non-probabilistic classification.

Learning Steps

Define hinge loss:
\( L = \max(0, 1 - y_i \cdot (w^Tx_i)) \)
Understand the geometric interpretation — encouraging a margin of at least 1.
Visualize misclassified vs correctly classified samples and how the gradient behaves.
Implement and compare to logistic loss — discuss why hinge loss doesn’t output probabilities.

Deeper Insight:
Hinge loss is not differentiable at the margin but still convex — showing your grasp of subgradients is key.
Probing Question: “How would you modify hinge loss for multiclass SVMs?”

🧩 Integration and Interview Depth

Note

Final Insight:
The ability to map data type → task → loss → optimizer demonstrates interview-ready mastery.
Top interviewers expect you to explain not just what loss you use, but why — in terms of robustness, convexity, interpretability, and gradient behavior.

Recommended Capstone Steps

Create a unified notebook comparing all regression losses on synthetic noisy data.
Create another comparing classification losses on separable vs overlapping datasets.
Analyze loss landscapes and gradient magnitudes for each case.
Be prepared to discuss practical criteria for loss function selection.

7. Hinge Loss