3.2. Maximum Likelihood Estimation (MLE)

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

Probability & Statistics for Data Science

6 min read 1070 words

🪄 Step 1: Intuition & Motivation

Core Idea: Maximum Likelihood Estimation (MLE) is how we make data teach us the best parameters for a statistical model. Instead of guessing, MLE finds the parameter values that make our observed data most probable under the model.
Simple Analogy: Imagine you’re a detective trying to find the most likely cause of a crime. You have clues (data) and multiple suspects (possible parameter values). MLE is like asking:
“Which suspect makes the evidence I have most likely to occur?” The answer: the parameter that maximizes the likelihood of observing your data.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Given a model with unknown parameters $\theta$ and observed data $x_1, x_2, \dots, x_n$, we define the likelihood function as:

$$ L(\theta) = P(x_1, x_2, \dots, x_n | \theta) $$

It measures how probable our data is for each possible value of $\theta$. The Maximum Likelihood Estimate is the value $\hat{\theta}$ that maximizes $L(\theta)$:

$$ \hat{\theta}*{MLE} = \arg \max*{\theta} L(\theta) $$

Because probabilities multiply, we usually take the log (to make it additive and numerically stable):

$$ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log P(x_i | \theta) $$

Then we find $\hat{\theta}$ by setting $\frac{d\ell(\theta)}{d\theta} = 0$.

Why It Works This Way

MLE chooses parameters that make the observed data most typical under the assumed model.

It doesn’t try to make predictions yet — it first says,

“If this model were true, what parameter values would have most likely produced what I actually saw?”

This is why it’s foundational for everything from regression to deep learning — because training a model is just maximizing likelihood (or minimizing negative log-likelihood).

How It Fits in ML Thinking

MLE underlies most modern machine learning:

Linear regression → equivalent to MLE under Gaussian noise.
Logistic regression → MLE under Bernoulli likelihood.
Neural networks → trained by minimizing negative log-likelihood (cross-entropy).

Understanding MLE gives you the theoretical backbone for why loss functions look the way they do.

📐 Step 3: Mathematical Foundation

Let’s derive MLEs for the two most common cases.

🎯 1. MLE for Bernoulli Distribution

Derivation Step-by-Step

A Bernoulli random variable $X$ takes value 1 (success) with probability $p$, and 0 (failure) with probability $1 - p$:

$$ P(X = x | p) = p^x (1 - p)^{1 - x} $$

For $n$ independent samples $x_1, x_2, …, x_n$, the likelihood is:

$$ L(p) = \prod_{i=1}^{n} p^{x_i} (1 - p)^{1 - x_i} $$

Take the log:

$$ \ell(p) = \sum_{i=1}^{n} [x_i \log p + (1 - x_i)\log(1 - p)] $$

Differentiate and set to zero:

$$ \frac{d\ell}{dp} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1 - p} = 0 $$

Solve for $p$:

$$ \hat{p} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

Result: The MLE for $p$ is simply the sample mean — intuitive and elegant.

The proportion of successes observed is the most likely value of $p$. If half your trials were “1,” the best estimate of $p$ is 0.5 — what else could be more natural?

🔔 2. MLE for Gaussian (Normal) Distribution

Derivation Step-by-Step

For a Gaussian random variable with mean $\mu$ and variance $\sigma^2$:

$$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$

For data $x_1, …, x_n$, the log-likelihood is:

$$ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 $$

Step 1: Maximize w.r.t. $\mu$

Take derivative:

$$ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0 $$

→ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum x_i$

Step 2: Maximize w.r.t. $\sigma^2$

Substitute $\hat{\mu}$ and set derivative to zero:

$$ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^4)}\sum_{i=1}^{n}(x_i - \bar{X})^2 = 0 $$

→ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2$

Result: The MLEs for a normal distribution are:

$$ \hat{\mu} = \bar{X}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{X})^2 $$

Note: MLE variance divides by $n$, not $n-1$ (the latter is for unbiased estimation).

MLE for a Gaussian says: pick the mean and spread that make the observed data points as “expected” as possible under a bell curve.

⚙️ 3. Numerical Optimization for Complex Models

When No Closed Form Exists

For complex likelihoods (e.g., logistic regression, mixture models), analytical derivatives can’t yield a closed-form solution. We then use numerical optimization:

Gradient Descent: Iteratively update parameters using gradients of $\ell(\theta)$.
Newton–Raphson / Quasi-Newton: Use curvature (Hessian) info for faster convergence.
Expectation-Maximization (EM): Alternate between estimating latent variables and optimizing parameters (used in GMMs).

Convergence Tip: Always monitor log-likelihood values — they should monotonically increase until convergence.

Numerical MLE is like “hill climbing” — you keep taking steps uphill on the likelihood landscape until there’s nowhere higher to go.

💭 Probing Question Insight

“What if the likelihood is non-convex? How would you ensure convergence?”

Non-convex likelihoods can have multiple peaks (local maxima). To handle this:

Use multiple random initializations (to escape poor local optima).
Apply stochastic optimization (e.g., SGD with noise helps explore).
Smooth the likelihood surface (regularization or annealing).
Visualize or monitor likelihood changes to detect stuck optimization.

In short: don’t trust one peak — test the terrain.

🧠 Step 4: Assumptions or Key Ideas

Data points are independent and identically distributed (i.i.d.).
Model correctly specifies the data-generating process.
Likelihood is differentiable for optimization.
MLE provides consistent (approaches true value with large $n$) and efficient (lowest variance) estimates under regularity conditions.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Universally applicable framework for parameter estimation.
Yields interpretable, asymptotically optimal estimates.
Theoretical link to most modern loss functions in ML.

Sensitive to outliers — maximizing likelihood can overfit noisy data.
May not converge for non-convex surfaces.
Requires strong assumptions about model correctness.

MLE trades bias for variance — it finds parameters that best explain this data, not necessarily future data. Regularized or Bayesian versions often improve robustness.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“Likelihood is the same as probability.” → No — likelihood is a function of parameters given data, not the other way around.
“MLE always gives unbiased estimates.” → Not always — it’s asymptotically unbiased but can be biased for small samples.
“MLE always converges.” → Only if the optimization surface is well-behaved and numerically stable.

🧩 Step 7: Mini Summary

🧠 What You Learned: MLE finds the parameter values that make observed data most probable under a model.

⚙️ How It Works: By maximizing the log-likelihood — analytically for simple models, numerically for complex ones.

🎯 Why It Matters: MLE is the backbone of almost every statistical and machine learning model — it’s how models “learn from data.”

3.3. Hypothesis Testing 3.1. Sampling & Estimation