2. Mean Absolute Error (MAE)

5 min read 873 words

🪄 Step 1: Intuition & Motivation

Core Idea: The Mean Absolute Error (MAE) measures the average magnitude of your model’s mistakes — how far, on average, predictions are from the actual values — without worrying whether those errors are positive or negative.
Simple Analogy: Imagine you’re driving on a highway where each wrong turn costs you distance. Instead of squaring the mistakes (like MSE does), you just measure how many kilometers you’ve gone off route — no drama, just the distance. That’s MAE — calm, fair, and not easily panicked by outliers.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

MAE looks at every prediction, finds the difference between what was predicted ($\hat{y}_i$) and what actually happened ($y_i$), takes its absolute value (so we don’t care about direction), and then averages them all.

So, if your model’s predictions are consistently 2 units off — whether above or below — MAE will report 2. It treats every mistake equally, no matter how big or small.

Why It Works This Way

Unlike MSE, which squares errors (making large mistakes explode), MAE is linear in the error size. That means:

A 10-unit mistake hurts exactly twice as much as a 5-unit mistake.
Big outliers don’t dominate the learning process.

This property makes MAE a robust loss — it’s much less sensitive to noisy or messy data.

How It Fits in ML Thinking

MAE is grounded in the L1 norm, which measures “distance” by summing absolute differences. In regression tasks, this means we’re looking for a model whose predictions minimize the average deviation from the truth — effectively finding the median of the data rather than the mean. This makes MAE useful when your target variable has outliers or heavy tails, where averages might mislead you.

📐 Step 3: Mathematical Foundation

Mean Absolute Error Formula

$$ MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| $$

$y_i$ → Actual target value for the $i^{th}$ observation
$\hat{y}_i$ → Model prediction for the same observation
$|.|$ → Absolute value ensures all errors are treated equally

MAE reports the average distance between predictions and ground truth — no exaggeration, just a straight count of how wrong you are.

Think of MAE as a “fair referee.” It doesn’t scream louder when you make a big mistake — it calmly notes, “You missed by this much,” and moves on.

The Median as the Optimal Estimator

Under the L1 norm, the best estimate that minimizes MAE is the median, not the mean.

Why? Because the median minimizes the total absolute deviation — half the data lies on either side.
So if you try to shift your model’s predictions, total deviation increases equally on both ends.

This is why, when using MAE, your model tends to align with the central tendency of data that’s more robust to outliers.

Subgradients and Optimization

MAE’s absolute value term makes it non-differentiable at 0 (where the error changes from positive to negative). But optimizers handle this using subgradients, which essentially say:

“If you’re exactly at 0, pick any slope between -1 and 1 — they all work fine.”

In practice, optimization still proceeds smoothly, though convergence is often slower than MSE due to constant gradient magnitude.

Where MSE’s gradient grows with error size (helping large mistakes shrink fast), MAE’s gradient is constant — so progress is steady but slower.

🧠 Step 4: Assumptions or Key Ideas

Linear Relationship (approximate): The underlying mapping between inputs and outputs remains linear, even though MAE is less strict about it.
Outlier Presence: MAE assumes data may contain noise or outliers and stays stable in such scenarios.
Equal Error Weighting: Every error, regardless of its size, contributes equally to the total loss.

These assumptions make MAE particularly valuable for robust regression — where reliability beats sensitivity.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Robust to outliers — big mistakes don’t dominate.
Easy to interpret — same units as the target variable.
Reflects the median tendency → less skewed by extremes.

Not differentiable at zero — optimization may be slower or less stable.
Constant gradient magnitude → sluggish convergence.
Can undervalue large errors that might matter in high-stakes predictions.

MAE is like a calm teacher — patient and fair, but not quick to adapt. It trades speed and sensitivity for robustness and reliability — ideal when your data isn’t perfect.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“MAE is just like MSE without squares” → Not quite! Removing squares changes the geometry — MAE minimizes median errors, not the mean.
“MAE is always better” → It depends. If data has few outliers, MSE converges faster and gives smoother optimization.
“MAE can’t be optimized” → It can, using subgradients or specialized algorithms like coordinate descent.

🧩 Step 7: Mini Summary

🧠 What You Learned: MAE measures the average absolute distance between predictions and actual values — making it more robust than MSE to noisy or extreme data.

⚙️ How It Works: By using the L1 norm, it treats every error equally and converges toward the median of the target distribution.

🎯 Why It Matters: MAE introduces the concept of robust optimization, a key stepping stone toward understanding hybrid loss functions like Huber Loss and Smooth-L1.

3. Root Mean Squared Error (RMSE)1. Mean Squared Error (MSE)