3.4. Sampling & Estimation

5 min read 981 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: In the real world, we almost never have all the data. We only get a sample — a small peek into a bigger population. Sampling and estimation are about using that limited sample to make our best, most honest guesses about the underlying truth.

  • Simple Analogy: Imagine you’re tasting a pot of soup. You take one spoonful — that’s your sample. If you mix the soup well (random sampling), the spoonful represents the whole pot. Your guess of how salty the soup is — that’s your estimator.


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

When we draw data samples (e.g., customer ratings, test scores), we treat each observation as a random variable from some true but unknown distribution.

Our job is to estimate parameters of that distribution — e.g., mean, variance, or model weights.

For example:

  • In a Gaussian distribution, estimate $\mu$ and $\sigma^2$.
  • In a Bernoulli trial, estimate $p$ (probability of success).

But we can’t just take any guess — we want estimators that are:

  1. Unbiased (correct on average),
  2. Consistent (improves with more data), and
  3. Efficient (lowest possible variance).

Why It Works This Way

Every estimate involves uncertainty. The goal isn’t perfection — it’s balancing bias and variance.

  • Bias = how far the average estimate is from the true value.
  • Variance = how much estimates fluctuate from sample to sample.

Too little data → high variance. Too simple a model → high bias.

This is the same logic behind underfitting vs. overfitting in ML.


How It Fits in ML Thinking

Machine learning is built entirely on estimation:

  • Fitting a model = estimating parameters from data.
  • Loss functions = quantifying estimation error.
  • Regularization = intentionally adding bias to reduce variance.

Even advanced models (like neural networks) secretly perform Maximum Likelihood Estimation (MLE) or its cousin, Maximum A Posteriori Estimation (MAP).


📐 Step 3: Mathematical Foundation

Sampling & Sample Statistics

Given random variables $X_1, X_2, …, X_n$ sampled i.i.d. from a distribution with unknown parameter $\theta$:

  • Sample mean:

    $$ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i $$

    → Estimator for population mean $\mu$.

  • Sample variance:

    $$ s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 $$

    → Estimator for population variance $\sigma^2$.

We replace population quantities (unknown) with sample equivalents (observable).

Maximum Likelihood Estimation (MLE)

We choose parameter $\theta$ that maximizes the probability of observing our data:

$$ \hat{\theta}*{MLE} = \arg\max*{\theta} P(X_1, X_2, ..., X_n | \theta) $$

Equivalently, maximize the log-likelihood (for easier math):

$$ \ell(\theta) = \sum_{i=1}^n \log P(X_i | \theta) $$

Example: For Gaussian data with unknown mean $\mu$,

$$ \ell(\mu) = -\frac{1}{2\sigma^2}\sum_i (X_i - \mu)^2 $$

Maximizing $\ell(\mu)$ gives $\hat{\mu} = \bar{X}$ — the sample mean.

MLE “fits the curve” that makes your observed data most probable — like finding the peak of the likelihood mountain.

Bias, Variance & Mean Squared Error

For an estimator $\hat{\theta}$ of a true parameter $\theta$:

  • Bias: $$ Bias(\hat{\theta}) = E[\hat{\theta}] - \theta $$
  • Variance: $$ Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2] $$
  • Mean Squared Error (MSE): $$ MSE(\hat{\theta}) = Bias^2 + Variance $$
MSE balances accuracy (low bias) and stability (low variance). In ML terms: regularization intentionally adds bias to tame variance.

Estimator Properties
  • Consistency: $\hat{\theta}_n \to \theta$ as $n \to \infty$ (the estimate converges to the truth).
  • Efficiency: Among unbiased estimators, the one with lowest variance.
  • Sufficiency: Uses all information in the sample about $\theta$.
A good estimator doesn’t just hit the target sometimes — it gets closer, steadier, and sharper as you collect more data.

Confidence Intervals

A confidence interval (CI) gives a range of plausible values for $\theta$:

$$ \hat{\theta} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} $$

Example: 95% CI for the mean = “We’re 95% confident that the true mean lies within this range.”

CI doesn’t mean “95% chance the mean is here.” It means “if we repeated the experiment many times, 95% of such intervals would contain the true mean.”

Hypothesis Testing (Brief Intro)

We test claims about population parameters.

  • Null hypothesis ($H_0$): Default assumption (e.g., “no difference in means”).
  • Alternative ($H_1$): Competing claim.
  • p-value: Probability of observing current (or more extreme) data if $H_0$ were true.

If $p < \alpha$ (significance level), we reject $H_0$.

Hypothesis testing isn’t about proving something true — it’s about checking if the evidence is strong enough to doubt the null.

🧠 Step 4: Key Ideas

  • Sampling provides a practical window into the population.
  • MLE finds parameters that make observed data most probable.
  • Bias–variance tradeoff defines the sweet spot between under- and overfitting.
  • Confidence intervals and p-values quantify uncertainty.
  • Consistency ensures estimates improve with more data.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • MLE has desirable properties: consistent, efficient, asymptotically normal.
  • Provides clear probabilistic interpretation of estimation.
  • Links directly to most machine learning objectives (e.g., cross-entropy, logistic regression).
  • Sensitive to small samples and outliers.
  • Confidence intervals assume approximate normality (can fail for skewed data).
  • Biased estimators may perform better in high-variance regimes.
Sometimes adding controlled bias (regularization, shrinkage) reduces overall error — like Ridge regression. Bias isn’t always bad; it’s often strategic.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • Myth: The sample mean always equals the true mean. → Truth: It’s only an unbiased estimator — true on average, not in every sample.
  • Myth: Unbiased estimators are always better. → Truth: A small bias may drastically lower variance — total error matters more.
  • Myth: 95% confidence = 95% probability. → Truth: It’s about repeated sampling frequency, not individual belief.

🧩 Step 7: Mini Summary

🧠 What You Learned: Sampling lets us infer truths from limited data; estimation formalizes that inference through methods like MLE.

⚙️ How It Works: Estimators balance bias and variance to minimize total error. Confidence intervals and hypothesis testing quantify how reliable our inferences are.

🎯 Why It Matters: All of data science is inference under uncertainty. Understanding sampling and estimation is understanding how your model trusts its own predictions.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!