3.2. Maximum Likelihood Estimation (MLE)

6 min read 1070 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Maximum Likelihood Estimation (MLE) is how we make data teach us the best parameters for a statistical model. Instead of guessing, MLE finds the parameter values that make our observed data most probable under the model.

  • Simple Analogy: Imagine you’re a detective trying to find the most likely cause of a crime. You have clues (data) and multiple suspects (possible parameter values). MLE is like asking:

    “Which suspect makes the evidence I have most likely to occur?” The answer: the parameter that maximizes the likelihood of observing your data.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Given a model with unknown parameters $\theta$ and observed data $x_1, x_2, \dots, x_n$, we define the likelihood function as:

$$ L(\theta) = P(x_1, x_2, \dots, x_n | \theta) $$

It measures how probable our data is for each possible value of $\theta$. The Maximum Likelihood Estimate is the value $\hat{\theta}$ that maximizes $L(\theta)$:

$$ \hat{\theta}*{MLE} = \arg \max*{\theta} L(\theta) $$

Because probabilities multiply, we usually take the log (to make it additive and numerically stable):

$$ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log P(x_i | \theta) $$

Then we find $\hat{\theta}$ by setting $\frac{d\ell(\theta)}{d\theta} = 0$.

Why It Works This Way

MLE chooses parameters that make the observed data most typical under the assumed model.

It doesn’t try to make predictions yet — it first says,

“If this model were true, what parameter values would have most likely produced what I actually saw?”

This is why it’s foundational for everything from regression to deep learning — because training a model is just maximizing likelihood (or minimizing negative log-likelihood).

How It Fits in ML Thinking

MLE underlies most modern machine learning:

  • Linear regression → equivalent to MLE under Gaussian noise.
  • Logistic regression → MLE under Bernoulli likelihood.
  • Neural networks → trained by minimizing negative log-likelihood (cross-entropy).

Understanding MLE gives you the theoretical backbone for why loss functions look the way they do.


📐 Step 3: Mathematical Foundation

Let’s derive MLEs for the two most common cases.


🎯 1. MLE for Bernoulli Distribution

Derivation Step-by-Step

A Bernoulli random variable $X$ takes value 1 (success) with probability $p$, and 0 (failure) with probability $1 - p$:

$$ P(X = x | p) = p^x (1 - p)^{1 - x} $$

For $n$ independent samples $x_1, x_2, …, x_n$, the likelihood is:

$$ L(p) = \prod_{i=1}^{n} p^{x_i} (1 - p)^{1 - x_i} $$

Take the log:

$$ \ell(p) = \sum_{i=1}^{n} [x_i \log p + (1 - x_i)\log(1 - p)] $$

Differentiate and set to zero:

$$ \frac{d\ell}{dp} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1 - p} = 0 $$

Solve for $p$:

$$ \hat{p} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

Result: The MLE for $p$ is simply the sample mean — intuitive and elegant.

The proportion of successes observed is the most likely value of $p$. If half your trials were “1,” the best estimate of $p$ is 0.5 — what else could be more natural?

🔔 2. MLE for Gaussian (Normal) Distribution

Derivation Step-by-Step

For a Gaussian random variable with mean $\mu$ and variance $\sigma^2$:

$$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$

For data $x_1, …, x_n$, the log-likelihood is:

$$ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 $$

Step 1: Maximize w.r.t. $\mu$

Take derivative:

$$ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0 $$

→ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum x_i$

Step 2: Maximize w.r.t. $\sigma^2$

Substitute $\hat{\mu}$ and set derivative to zero:

$$ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^4)}\sum_{i=1}^{n}(x_i - \bar{X})^2 = 0 $$

→ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2$

Result: The MLEs for a normal distribution are:

$$ \hat{\mu} = \bar{X}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2 $$

Note: MLE variance divides by $n$, not $n-1$ (the latter is for unbiased estimation).

MLE for a Gaussian says: pick the mean and spread that make the observed data points as “expected” as possible under a bell curve.

⚙️ 3. Numerical Optimization for Complex Models

When No Closed Form Exists

For complex likelihoods (e.g., logistic regression, mixture models), analytical derivatives can’t yield a closed-form solution. We then use numerical optimization:

  • Gradient Descent: Iteratively update parameters using gradients of $\ell(\theta)$.
  • Newton–Raphson / Quasi-Newton: Use curvature (Hessian) info for faster convergence.
  • Expectation-Maximization (EM): Alternate between estimating latent variables and optimizing parameters (used in GMMs).

Convergence Tip: Always monitor log-likelihood values — they should monotonically increase until convergence.

Numerical MLE is like “hill climbing” — you keep taking steps uphill on the likelihood landscape until there’s nowhere higher to go.

💭 Probing Question Insight

“What if the likelihood is non-convex? How would you ensure convergence?”

Non-convex likelihoods can have multiple peaks (local maxima). To handle this:

  • Use multiple random initializations (to escape poor local optima).
  • Apply stochastic optimization (e.g., SGD with noise helps explore).
  • Smooth the likelihood surface (regularization or annealing).
  • Visualize or monitor likelihood changes to detect stuck optimization.

In short: don’t trust one peak — test the terrain.


🧠 Step 4: Assumptions or Key Ideas

  • Data points are independent and identically distributed (i.i.d.).
  • Model correctly specifies the data-generating process.
  • Likelihood is differentiable for optimization.
  • MLE provides consistent (approaches true value with large $n$) and efficient (lowest variance) estimates under regularity conditions.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Universally applicable framework for parameter estimation.
  • Yields interpretable, asymptotically optimal estimates.
  • Theoretical link to most modern loss functions in ML.
  • Sensitive to outliers — maximizing likelihood can overfit noisy data.
  • May not converge for non-convex surfaces.
  • Requires strong assumptions about model correctness.
MLE trades bias for variance — it finds parameters that best explain this data, not necessarily future data. Regularized or Bayesian versions often improve robustness.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “Likelihood is the same as probability.” → No — likelihood is a function of parameters given data, not the other way around.
  • “MLE always gives unbiased estimates.” → Not always — it’s asymptotically unbiased but can be biased for small samples.
  • “MLE always converges.” → Only if the optimization surface is well-behaved and numerically stable.

🧩 Step 7: Mini Summary

🧠 What You Learned: MLE finds the parameter values that make observed data most probable under a model.

⚙️ How It Works: By maximizing the log-likelihood — analytically for simple models, numerically for complex ones.

🎯 Why It Matters: MLE is the backbone of almost every statistical and machine learning model — it’s how models “learn from data.”

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!