2. Mean Absolute Error (MAE)
🪄 Step 1: Intuition & Motivation
Core Idea: The Mean Absolute Error (MAE) measures the average magnitude of your model’s mistakes — how far, on average, predictions are from the actual values — without worrying whether those errors are positive or negative.
Simple Analogy: Imagine you’re driving on a highway where each wrong turn costs you distance. Instead of squaring the mistakes (like MSE does), you just measure how many kilometers you’ve gone off route — no drama, just the distance. That’s MAE — calm, fair, and not easily panicked by outliers.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
MAE looks at every prediction, finds the difference between what was predicted ($\hat{y}_i$) and what actually happened ($y_i$), takes its absolute value (so we don’t care about direction), and then averages them all.
So, if your model’s predictions are consistently 2 units off — whether above or below — MAE will report 2. It treats every mistake equally, no matter how big or small.
Why It Works This Way
Unlike MSE, which squares errors (making large mistakes explode), MAE is linear in the error size. That means:
- A 10-unit mistake hurts exactly twice as much as a 5-unit mistake.
- Big outliers don’t dominate the learning process.
This property makes MAE a robust loss — it’s much less sensitive to noisy or messy data.
How It Fits in ML Thinking
📐 Step 3: Mathematical Foundation
Mean Absolute Error Formula
- $y_i$ → Actual target value for the $i^{th}$ observation
- $\hat{y}_i$ → Model prediction for the same observation
- $|.|$ → Absolute value ensures all errors are treated equally
MAE reports the average distance between predictions and ground truth — no exaggeration, just a straight count of how wrong you are.
The Median as the Optimal Estimator
Under the L1 norm, the best estimate that minimizes MAE is the median, not the mean.
- Why? Because the median minimizes the total absolute deviation — half the data lies on either side.
- So if you try to shift your model’s predictions, total deviation increases equally on both ends.
This is why, when using MAE, your model tends to align with the central tendency of data that’s more robust to outliers.
Subgradients and Optimization
MAE’s absolute value term makes it non-differentiable at 0 (where the error changes from positive to negative). But optimizers handle this using subgradients, which essentially say:
“If you’re exactly at 0, pick any slope between -1 and 1 — they all work fine.”
In practice, optimization still proceeds smoothly, though convergence is often slower than MSE due to constant gradient magnitude.
🧠 Step 4: Assumptions or Key Ideas
- Linear Relationship (approximate): The underlying mapping between inputs and outputs remains linear, even though MAE is less strict about it.
- Outlier Presence: MAE assumes data may contain noise or outliers and stays stable in such scenarios.
- Equal Error Weighting: Every error, regardless of its size, contributes equally to the total loss.
These assumptions make MAE particularly valuable for robust regression — where reliability beats sensitivity.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Robust to outliers — big mistakes don’t dominate.
- Easy to interpret — same units as the target variable.
- Reflects the median tendency → less skewed by extremes.
- Not differentiable at zero — optimization may be slower or less stable.
- Constant gradient magnitude → sluggish convergence.
- Can undervalue large errors that might matter in high-stakes predictions.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “MAE is just like MSE without squares” → Not quite! Removing squares changes the geometry — MAE minimizes median errors, not the mean.
- “MAE is always better” → It depends. If data has few outliers, MSE converges faster and gives smoother optimization.
- “MAE can’t be optimized” → It can, using subgradients or specialized algorithms like coordinate descent.
🧩 Step 7: Mini Summary
🧠 What You Learned: MAE measures the average absolute distance between predictions and actual values — making it more robust than MSE to noisy or extreme data.
⚙️ How It Works: By using the L1 norm, it treats every error equally and converges toward the median of the target distribution.
🎯 Why It Matters: MAE introduces the concept of robust optimization, a key stepping stone toward understanding hybrid loss functions like Huber Loss and Smooth-L1.