3.4. Sampling & Estimation
🪄 Step 1: Intuition & Motivation
Core Idea: In the real world, we almost never have all the data. We only get a sample — a small peek into a bigger population. Sampling and estimation are about using that limited sample to make our best, most honest guesses about the underlying truth.
Simple Analogy: Imagine you’re tasting a pot of soup. You take one spoonful — that’s your sample. If you mix the soup well (random sampling), the spoonful represents the whole pot. Your guess of how salty the soup is — that’s your estimator.
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
When we draw data samples (e.g., customer ratings, test scores), we treat each observation as a random variable from some true but unknown distribution.
Our job is to estimate parameters of that distribution — e.g., mean, variance, or model weights.
For example:
- In a Gaussian distribution, estimate $\mu$ and $\sigma^2$.
- In a Bernoulli trial, estimate $p$ (probability of success).
But we can’t just take any guess — we want estimators that are:
- Unbiased (correct on average),
- Consistent (improves with more data), and
- Efficient (lowest possible variance).
Why It Works This Way
Every estimate involves uncertainty. The goal isn’t perfection — it’s balancing bias and variance.
- Bias = how far the average estimate is from the true value.
- Variance = how much estimates fluctuate from sample to sample.
Too little data → high variance. Too simple a model → high bias.
This is the same logic behind underfitting vs. overfitting in ML.
How It Fits in ML Thinking
Machine learning is built entirely on estimation:
- Fitting a model = estimating parameters from data.
- Loss functions = quantifying estimation error.
- Regularization = intentionally adding bias to reduce variance.
Even advanced models (like neural networks) secretly perform Maximum Likelihood Estimation (MLE) or its cousin, Maximum A Posteriori Estimation (MAP).
📐 Step 3: Mathematical Foundation
Sampling & Sample Statistics
Given random variables $X_1, X_2, …, X_n$ sampled i.i.d. from a distribution with unknown parameter $\theta$:
Sample mean:
$$ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i $$→ Estimator for population mean $\mu$.
Sample variance:
$$ s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 $$→ Estimator for population variance $\sigma^2$.
Maximum Likelihood Estimation (MLE)
We choose parameter $\theta$ that maximizes the probability of observing our data:
$$ \hat{\theta}*{MLE} = \arg\max*{\theta} P(X_1, X_2, ..., X_n | \theta) $$Equivalently, maximize the log-likelihood (for easier math):
$$ \ell(\theta) = \sum_{i=1}^n \log P(X_i | \theta) $$Example: For Gaussian data with unknown mean $\mu$,
$$ \ell(\mu) = -\frac{1}{2\sigma^2}\sum_i (X_i - \mu)^2 $$Maximizing $\ell(\mu)$ gives $\hat{\mu} = \bar{X}$ — the sample mean.
Bias, Variance & Mean Squared Error
For an estimator $\hat{\theta}$ of a true parameter $\theta$:
- Bias: $$ Bias(\hat{\theta}) = E[\hat{\theta}] - \theta $$
- Variance: $$ Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2] $$
- Mean Squared Error (MSE): $$ MSE(\hat{\theta}) = Bias^2 + Variance $$
Estimator Properties
- Consistency: $\hat{\theta}_n \to \theta$ as $n \to \infty$ (the estimate converges to the truth).
- Efficiency: Among unbiased estimators, the one with lowest variance.
- Sufficiency: Uses all information in the sample about $\theta$.
Confidence Intervals
A confidence interval (CI) gives a range of plausible values for $\theta$:
$$ \hat{\theta} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} $$Example: 95% CI for the mean = “We’re 95% confident that the true mean lies within this range.”
Hypothesis Testing (Brief Intro)
We test claims about population parameters.
- Null hypothesis ($H_0$): Default assumption (e.g., “no difference in means”).
- Alternative ($H_1$): Competing claim.
- p-value: Probability of observing current (or more extreme) data if $H_0$ were true.
If $p < \alpha$ (significance level), we reject $H_0$.
🧠 Step 4: Key Ideas
- Sampling provides a practical window into the population.
- MLE finds parameters that make observed data most probable.
- Bias–variance tradeoff defines the sweet spot between under- and overfitting.
- Confidence intervals and p-values quantify uncertainty.
- Consistency ensures estimates improve with more data.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- MLE has desirable properties: consistent, efficient, asymptotically normal.
- Provides clear probabilistic interpretation of estimation.
- Links directly to most machine learning objectives (e.g., cross-entropy, logistic regression).
- Sensitive to small samples and outliers.
- Confidence intervals assume approximate normality (can fail for skewed data).
- Biased estimators may perform better in high-variance regimes.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- Myth: The sample mean always equals the true mean. → Truth: It’s only an unbiased estimator — true on average, not in every sample.
- Myth: Unbiased estimators are always better. → Truth: A small bias may drastically lower variance — total error matters more.
- Myth: 95% confidence = 95% probability. → Truth: It’s about repeated sampling frequency, not individual belief.
🧩 Step 7: Mini Summary
🧠 What You Learned: Sampling lets us infer truths from limited data; estimation formalizes that inference through methods like MLE.
⚙️ How It Works: Estimators balance bias and variance to minimize total error. Confidence intervals and hypothesis testing quantify how reliable our inferences are.
🎯 Why It Matters: All of data science is inference under uncertainty. Understanding sampling and estimation is understanding how your model trusts its own predictions.