5.1. Bayesian Inference & Priors
🪄 Step 1: Intuition & Motivation
Core Idea: Bayesian inference is about learning from evidence — mathematically updating your beliefs when new data arrives. It answers:
“Given what I believed before, and what I just observed, what should I believe now?”
Simple Analogy: Imagine you’re guessing whether a coin is fair.
- Before flipping — you believe it’s probably fair (your prior).
- After a few flips — you update that belief using the data (your posterior).
- If you flip many times, your belief increasingly depends on data, not your prior.
Bayesian inference is like combining experience (prior) with evidence (data) to form updated wisdom (posterior).
🌱 Step 2: Core Concept
What’s Happening Under the Hood?
At the heart of Bayesian inference lies Bayes’ theorem:
$$ P(\theta | D) = \frac{P(D | \theta) , P(\theta)}{P(D)} $$where:
- $P(\theta)$ = Prior (what we believe before seeing data)
- $P(D | \theta)$ = Likelihood (how likely the data is, given our assumption)
- $P(\theta | D)$ = Posterior (updated belief after seeing data)
- $P(D)$ = Evidence (a normalizing constant ensuring probabilities sum to 1)
The posterior represents a compromise between what we thought and what we saw.
Why It Works This Way
The Bayesian approach treats everything as uncertain — even parameters! Instead of fixed values, parameters are random variables whose distributions evolve with evidence.
Frequentists say: “The parameter is fixed, data is random.” Bayesians say: “The data is fixed, but the parameter is uncertain.”
This shift allows us to reason probabilistically about our uncertainty in models themselves — a powerful perspective in data science and ML.
How It Fits in ML Thinking
Bayesian inference underpins many modern algorithms:
- Naïve Bayes: Uses priors over classes.
- Bayesian Networks: Graphical models built on conditional probabilities.
- Regularization: Can be seen as imposing a prior belief (e.g., “weights should be small”).
- Bayesian Optimization: Updates belief about a function’s shape using observed samples.
In ML terms, Bayesian reasoning = “belief updating with data.”
📐 Step 3: Mathematical Foundation
Let’s make the abstract concrete by working through examples.
🧩 1. Conjugate Priors
What They Are & Why They Matter
A conjugate prior is a prior distribution that, when combined with a likelihood, yields a posterior of the same family. This makes Bayesian updating analytically simple and elegant.
Example 1: Beta-Binomial Model (Coin Flips) If data comes from a Binomial likelihood (e.g., number of heads in coin flips), and the prior on the probability $p$ is Beta$(\alpha, \beta)$, then the posterior is also Beta:
$$ p | D \sim \text{Beta}(\alpha + k, , \beta + n - k) $$where:
- $k$ = number of successes (heads)
- $n$ = number of trials
Interpretation: You start with “pseudo-counts” of $\alpha - 1$ heads and $\beta - 1$ tails. Each new observation updates these counts.
If $\alpha = \beta = 1$, that’s a uniform prior — no initial bias.
Example 2: Normal-Normal Model (Mean Estimation) For data with Gaussian noise, if prior on mean $\mu$ is $N(\mu_0, \sigma_0^2)$ and likelihood $Y_i \sim N(\mu, \sigma^2)$, then posterior is also normal:
$$ \mu | D \sim N\left( \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{Y}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, ; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} \right) $$🎯 2. Posterior Predictive Distribution
The Next-Step of Bayesian Thinking
After updating your posterior belief about $\theta$, you might ask:
“Given my new belief, what do I expect to see next?”
This is the posterior predictive distribution:
$$ P(x_{new} | D) = \int P(x_{new} | \theta) , P(\theta | D) , d\theta $$It averages predictions across all possible parameter values weighted by how plausible they are (from the posterior).
Interpretation: It’s a probabilistic “forecast” — accounting for both model uncertainty and data uncertainty.
Example: In the coin-flip Beta-Binomial case:
$$ P(\text{next flip = head} | D) = \frac{\alpha + k}{\alpha + \beta + n} $$So your “updated” belief about the next event directly reflects your new counts of heads and tails.
🔄 3. Bayesian Updating by Hand
Example Walkthrough
Let’s revisit the coin-flip example:
Start with prior: $p \sim \text{Beta}(2, 2)$ (you think the coin is roughly fair).
Observe data: 3 heads, 1 tail → $k = 3, n = 4$.
Update posterior:
$$ p | D \sim \text{Beta}(2 + 3, 2 + 1) = \text{Beta}(5, 3) $$Posterior mean:
$$ E[p|D] = \frac{5}{5 + 3} = 0.625 $$
So after 3 heads and 1 tail, your belief shifts — you now think the coin lands heads ~62.5% of the time.
If you had started with a stronger prior (say Beta(20,20)), the same data would move your belief much less.
💭 Probing Question:
“If you have little data, how does your choice of prior affect the posterior?”
Answer: When data is scarce, the prior dominates the posterior — your beliefs heavily shape your conclusions. As data grows, the likelihood (data evidence) overwhelms the prior, and everyone’s posterior beliefs converge.
Rule of Thumb:
- Little data → Prior matters a lot (be cautious, or use weakly informative priors).
- Plenty of data → Posterior ≈ Frequentist estimate.
🧠 Step 4: Assumptions or Key Ideas
- Parameters are treated as random variables.
- The prior should reflect genuine knowledge (or neutrality if unknown).
- Posterior balances prior beliefs and observed data.
- Conjugate priors simplify math but are not required — modern methods use MCMC for any prior.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Naturally incorporates prior knowledge.
- Provides full distributions, not just point estimates.
- Adapts gracefully as new data arrives.
- Choice of prior can bias results (especially with limited data).
- Computation can be complex for high-dimensional models.
- Interpretation requires careful probabilistic reasoning.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
- “The prior is just guessing.” → No — it encodes prior evidence, expert knowledge, or symmetry.
- “Bayesian = subjective.” → Priors can be subjective, but the updating process is objective.
- “Frequentist vs Bayesian is a competition.” → They’re complementary lenses — Bayesians quantify uncertainty explicitly, Frequentists rely on long-run frequency logic.
🧩 Step 7: Mini Summary
🧠 What You Learned: Bayesian inference updates beliefs about parameters as new data arrives, balancing prior assumptions with observed evidence.
⚙️ How It Works: Through Bayes’ theorem — the posterior ∝ prior × likelihood. Conjugate priors make this update mathematically elegant.
🎯 Why It Matters: This framework powers adaptive, uncertainty-aware models — vital in ML, from Naïve Bayes to Bayesian Neural Networks.