3.4. Sampling & Estimation

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

5 min read 981 words

🪄 Step 1: Intuition & Motivation

Core Idea: In the real world, we almost never have all the data. We only get a sample — a small peek into a bigger population. Sampling and estimation are about using that limited sample to make our best, most honest guesses about the underlying truth.
Simple Analogy: Imagine you’re tasting a pot of soup. You take one spoonful — that’s your sample. If you mix the soup well (random sampling), the spoonful represents the whole pot. Your guess of how salty the soup is — that’s your estimator.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When we draw data samples (e.g., customer ratings, test scores), we treat each observation as a random variable from some true but unknown distribution.

Our job is to estimate parameters of that distribution — e.g., mean, variance, or model weights.

For example:

In a Gaussian distribution, estimate $\mu$ and $\sigma^2$.
In a Bernoulli trial, estimate $p$ (probability of success).

But we can’t just take any guess — we want estimators that are:

Unbiased (correct on average),
Consistent (improves with more data), and
Efficient (lowest possible variance).

Why It Works This Way

Every estimate involves uncertainty. The goal isn’t perfection — it’s balancing bias and variance.

Bias = how far the average estimate is from the true value.
Variance = how much estimates fluctuate from sample to sample.

Too little data → high variance. Too simple a model → high bias.

This is the same logic behind underfitting vs. overfitting in ML.

How It Fits in ML Thinking

Machine learning is built entirely on estimation:

Fitting a model = estimating parameters from data.
Loss functions = quantifying estimation error.
Regularization = intentionally adding bias to reduce variance.

Even advanced models (like neural networks) secretly perform Maximum Likelihood Estimation (MLE) or its cousin, Maximum A Posteriori Estimation (MAP).

📐 Step 3: Mathematical Foundation

Sampling & Sample Statistics

Given random variables $X_1, X_2, …, X_n$ sampled i.i.d. from a distribution with unknown parameter $\theta$:

Sample mean:
$$ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i $$
→ Estimator for population mean $\mu$.
Sample variance:
$$ s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 $$
→ Estimator for population variance $\sigma^2$.

We replace population quantities (unknown) with sample equivalents (observable).

Maximum Likelihood Estimation (MLE)

We choose parameter $\theta$ that maximizes the probability of observing our data:

$$ \hat{\theta}*{MLE} = \arg\max*{\theta} P(X_1, X_2, ..., X_n | \theta) $$

Equivalently, maximize the log-likelihood (for easier math):

$$ \ell(\theta) = \sum_{i=1}^n \log P(X_i | \theta) $$

Example: For Gaussian data with unknown mean $\mu$,

$$ \ell(\mu) = -\frac{1}{2\sigma^2}\sum_i (X_i - \mu)^2 $$

Maximizing $\ell(\mu)$ gives $\hat{\mu} = \bar{X}$ — the sample mean.

MLE “fits the curve” that makes your observed data most probable — like finding the peak of the likelihood mountain.

Bias, Variance & Mean Squared Error

For an estimator $\hat{\theta}$ of a true parameter $\theta$:

Bias: $$ Bias(\hat{\theta}) = E[\hat{\theta}] - \theta $$
Variance: $$ Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2] $$
Mean Squared Error (MSE): $$ MSE(\hat{\theta}) = Bias^2 + Variance $$

MSE balances accuracy (low bias) and stability (low variance). In ML terms: regularization intentionally adds bias to tame variance.

Estimator Properties

Consistency: $\hat{\theta}_n \to \theta$ as $n \to \infty$ (the estimate converges to the truth).
Efficiency: Among unbiased estimators, the one with lowest variance.
Sufficiency: Uses all information in the sample about $\theta$.

A good estimator doesn’t just hit the target sometimes — it gets closer, steadier, and sharper as you collect more data.

Confidence Intervals

A confidence interval (CI) gives a range of plausible values for $\theta$:

$$ \hat{\theta} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} $$

Example: 95% CI for the mean = “We’re 95% confident that the true mean lies within this range.”

CI doesn’t mean “95% chance the mean is here.” It means “if we repeated the experiment many times, 95% of such intervals would contain the true mean.”

Hypothesis Testing (Brief Intro)

We test claims about population parameters.

Null hypothesis ($H_0$): Default assumption (e.g., “no difference in means”).
Alternative ($H_1$): Competing claim.
p-value: Probability of observing current (or more extreme) data if $H_0$ were true.

If $p < \alpha$ (significance level), we reject $H_0$.

Hypothesis testing isn’t about proving something true — it’s about checking if the evidence is strong enough to doubt the null.

🧠 Step 4: Key Ideas

Sampling provides a practical window into the population.
MLE finds parameters that make observed data most probable.
Bias–variance tradeoff defines the sweet spot between under- and overfitting.
Confidence intervals and p-values quantify uncertainty.
Consistency ensures estimates improve with more data.

⚖️ Step 5: Strengths, Limitations & Trade-offs

MLE has desirable properties: consistent, efficient, asymptotically normal.
Provides clear probabilistic interpretation of estimation.
Links directly to most machine learning objectives (e.g., cross-entropy, logistic regression).

Sensitive to small samples and outliers.
Confidence intervals assume approximate normality (can fail for skewed data).
Biased estimators may perform better in high-variance regimes.

Sometimes adding controlled bias (regularization, shrinkage) reduces overall error — like Ridge regression. Bias isn’t always bad; it’s often strategic.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

Myth: The sample mean always equals the true mean. → Truth: It’s only an unbiased estimator — true on average, not in every sample.
Myth: Unbiased estimators are always better. → Truth: A small bias may drastically lower variance — total error matters more.
Myth: 95% confidence = 95% probability. → Truth: It’s about repeated sampling frequency, not individual belief.

🧩 Step 7: Mini Summary

🧠 What You Learned: Sampling lets us infer truths from limited data; estimation formalizes that inference through methods like MLE.

⚙️ How It Works: Estimators balance bias and variance to minimize total error. Confidence intervals and hypothesis testing quantify how reliable our inferences are.

🎯 Why It Matters: All of data science is inference under uncertainty. Understanding sampling and estimation is understanding how your model trusts its own predictions.

4.1. Entropy, Cross-Entropy & KL Divergence 3.3. Bayes’ Theorem & Conditional Probability