3.2. Maximum Likelihood Estimation (MLE)
🪄 Step 1: Intuition & Motivation
Core Idea: Maximum Likelihood Estimation (MLE) is how we make data teach us the best parameters for a statistical model. Instead of guessing, MLE finds the parameter values that make our observed data most probable under the model.
Simple Analogy: Imagine you’re a detective trying to find the most likely cause of a crime. You have clues (data) and multiple suspects (possible parameter values). MLE is like asking:
“Which suspect makes the evidence I have most likely to occur?” The answer: the parameter that maximizes the likelihood of observing your data.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
Given a model with unknown parameters $\theta$ and observed data $x_1, x_2, \dots, x_n$, we define the likelihood function as:
$$ L(\theta) = P(x_1, x_2, \dots, x_n | \theta) $$It measures how probable our data is for each possible value of $\theta$. The Maximum Likelihood Estimate is the value $\hat{\theta}$ that maximizes $L(\theta)$:
$$ \hat{\theta}*{MLE} = \arg \max*{\theta} L(\theta) $$Because probabilities multiply, we usually take the log (to make it additive and numerically stable):
$$ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log P(x_i | \theta) $$Then we find $\hat{\theta}$ by setting $\frac{d\ell(\theta)}{d\theta} = 0$.
Why It Works This Way
MLE chooses parameters that make the observed data most typical under the assumed model.
It doesn’t try to make predictions yet — it first says,
“If this model were true, what parameter values would have most likely produced what I actually saw?”
This is why it’s foundational for everything from regression to deep learning — because training a model is just maximizing likelihood (or minimizing negative log-likelihood).
How It Fits in ML Thinking
MLE underlies most modern machine learning:
- Linear regression → equivalent to MLE under Gaussian noise.
- Logistic regression → MLE under Bernoulli likelihood.
- Neural networks → trained by minimizing negative log-likelihood (cross-entropy).
Understanding MLE gives you the theoretical backbone for why loss functions look the way they do.
📐 Step 3: Mathematical Foundation
Let’s derive MLEs for the two most common cases.
🎯 1. MLE for Bernoulli Distribution
Derivation Step-by-Step
A Bernoulli random variable $X$ takes value 1 (success) with probability $p$, and 0 (failure) with probability $1 - p$:
$$ P(X = x | p) = p^x (1 - p)^{1 - x} $$For $n$ independent samples $x_1, x_2, …, x_n$, the likelihood is:
$$ L(p) = \prod_{i=1}^{n} p^{x_i} (1 - p)^{1 - x_i} $$Take the log:
$$ \ell(p) = \sum_{i=1}^{n} [x_i \log p + (1 - x_i)\log(1 - p)] $$Differentiate and set to zero:
$$ \frac{d\ell}{dp} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1 - p} = 0 $$Solve for $p$:
$$ \hat{p} = \frac{1}{n} \sum_{i=1}^{n} x_i $$Result: The MLE for $p$ is simply the sample mean — intuitive and elegant.
🔔 2. MLE for Gaussian (Normal) Distribution
Derivation Step-by-Step
For a Gaussian random variable with mean $\mu$ and variance $\sigma^2$:
$$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$For data $x_1, …, x_n$, the log-likelihood is:
$$ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 $$Step 1: Maximize w.r.t. $\mu$
Take derivative:
$$ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0 $$→ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum x_i$
Step 2: Maximize w.r.t. $\sigma^2$
Substitute $\hat{\mu}$ and set derivative to zero:
$$ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^4)}\sum_{i=1}^{n}(x_i - \bar{X})^2 = 0 $$→ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2$
Result: The MLEs for a normal distribution are:
$$ \hat{\mu} = \bar{X}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2 $$Note: MLE variance divides by $n$, not $n-1$ (the latter is for unbiased estimation).
⚙️ 3. Numerical Optimization for Complex Models
When No Closed Form Exists
For complex likelihoods (e.g., logistic regression, mixture models), analytical derivatives can’t yield a closed-form solution. We then use numerical optimization:
- Gradient Descent: Iteratively update parameters using gradients of $\ell(\theta)$.
- Newton–Raphson / Quasi-Newton: Use curvature (Hessian) info for faster convergence.
- Expectation-Maximization (EM): Alternate between estimating latent variables and optimizing parameters (used in GMMs).
Convergence Tip: Always monitor log-likelihood values — they should monotonically increase until convergence.
💭 Probing Question Insight
“What if the likelihood is non-convex? How would you ensure convergence?”
Non-convex likelihoods can have multiple peaks (local maxima). To handle this:
- Use multiple random initializations (to escape poor local optima).
- Apply stochastic optimization (e.g., SGD with noise helps explore).
- Smooth the likelihood surface (regularization or annealing).
- Visualize or monitor likelihood changes to detect stuck optimization.
In short: don’t trust one peak — test the terrain.
🧠 Step 4: Assumptions or Key Ideas
- Data points are independent and identically distributed (i.i.d.).
- Model correctly specifies the data-generating process.
- Likelihood is differentiable for optimization.
- MLE provides consistent (approaches true value with large $n$) and efficient (lowest variance) estimates under regularity conditions.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Universally applicable framework for parameter estimation.
- Yields interpretable, asymptotically optimal estimates.
- Theoretical link to most modern loss functions in ML.
- Sensitive to outliers — maximizing likelihood can overfit noisy data.
- May not converge for non-convex surfaces.
- Requires strong assumptions about model correctness.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “Likelihood is the same as probability.” → No — likelihood is a function of parameters given data, not the other way around.
- “MLE always gives unbiased estimates.” → Not always — it’s asymptotically unbiased but can be biased for small samples.
- “MLE always converges.” → Only if the optimization surface is well-behaved and numerically stable.
🧩 Step 7: Mini Summary
🧠 What You Learned: MLE finds the parameter values that make observed data most probable under a model.
⚙️ How It Works: By maximizing the log-likelihood — analytically for simple models, numerically for complex ones.
🎯 Why It Matters: MLE is the backbone of almost every statistical and machine learning model — it’s how models “learn from data.”