5.1. Bayesian Inference & Priors

6 min read 1121 words

🪄 Step 1: Intuition & Motivation

  • Core Idea: Bayesian inference is about learning from evidence — mathematically updating your beliefs when new data arrives. It answers:

    “Given what I believed before, and what I just observed, what should I believe now?”

  • Simple Analogy: Imagine you’re guessing whether a coin is fair.

    • Before flipping — you believe it’s probably fair (your prior).
    • After a few flips — you update that belief using the data (your posterior).
    • If you flip many times, your belief increasingly depends on data, not your prior.

    Bayesian inference is like combining experience (prior) with evidence (data) to form updated wisdom (posterior).


🌱 Step 2: Core Concept

What’s Happening Under the Hood?

At the heart of Bayesian inference lies Bayes’ theorem:

$$ P(\theta | D) = \frac{P(D | \theta) , P(\theta)}{P(D)} $$

where:

  • $P(\theta)$ = Prior (what we believe before seeing data)
  • $P(D | \theta)$ = Likelihood (how likely the data is, given our assumption)
  • $P(\theta | D)$ = Posterior (updated belief after seeing data)
  • $P(D)$ = Evidence (a normalizing constant ensuring probabilities sum to 1)

The posterior represents a compromise between what we thought and what we saw.

Why It Works This Way

The Bayesian approach treats everything as uncertain — even parameters! Instead of fixed values, parameters are random variables whose distributions evolve with evidence.

Frequentists say: “The parameter is fixed, data is random.” Bayesians say: “The data is fixed, but the parameter is uncertain.”

This shift allows us to reason probabilistically about our uncertainty in models themselves — a powerful perspective in data science and ML.

How It Fits in ML Thinking

Bayesian inference underpins many modern algorithms:

  • Naïve Bayes: Uses priors over classes.
  • Bayesian Networks: Graphical models built on conditional probabilities.
  • Regularization: Can be seen as imposing a prior belief (e.g., “weights should be small”).
  • Bayesian Optimization: Updates belief about a function’s shape using observed samples.

In ML terms, Bayesian reasoning = “belief updating with data.”


📐 Step 3: Mathematical Foundation

Let’s make the abstract concrete by working through examples.


🧩 1. Conjugate Priors

What They Are & Why They Matter

A conjugate prior is a prior distribution that, when combined with a likelihood, yields a posterior of the same family. This makes Bayesian updating analytically simple and elegant.

Example 1: Beta-Binomial Model (Coin Flips) If data comes from a Binomial likelihood (e.g., number of heads in coin flips), and the prior on the probability $p$ is Beta$(\alpha, \beta)$, then the posterior is also Beta:

$$ p | D \sim \text{Beta}(\alpha + k, , \beta + n - k) $$

where:

  • $k$ = number of successes (heads)
  • $n$ = number of trials

Interpretation: You start with “pseudo-counts” of $\alpha - 1$ heads and $\beta - 1$ tails. Each new observation updates these counts.

If $\alpha = \beta = 1$, that’s a uniform prior — no initial bias.

Example 2: Normal-Normal Model (Mean Estimation) For data with Gaussian noise, if prior on mean $\mu$ is $N(\mu_0, \sigma_0^2)$ and likelihood $Y_i \sim N(\mu, \sigma^2)$, then posterior is also normal:

$$ \mu | D \sim N\left( \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{Y}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, ; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} \right) $$
Conjugate priors are like plug-and-play beliefs — your posterior stays in the same “family,” just with updated parameters.

🎯 2. Posterior Predictive Distribution

The Next-Step of Bayesian Thinking

After updating your posterior belief about $\theta$, you might ask:

“Given my new belief, what do I expect to see next?”

This is the posterior predictive distribution:

$$ P(x_{new} | D) = \int P(x_{new} | \theta) , P(\theta | D) , d\theta $$

It averages predictions across all possible parameter values weighted by how plausible they are (from the posterior).

Interpretation: It’s a probabilistic “forecast” — accounting for both model uncertainty and data uncertainty.

Example: In the coin-flip Beta-Binomial case:

$$ P(\text{next flip = head} | D) = \frac{\alpha + k}{\alpha + \beta + n} $$

So your “updated” belief about the next event directly reflects your new counts of heads and tails.

Frequentists estimate one parameter and plug it in; Bayesians average over all possible parameters, weighted by belief.

🔄 3. Bayesian Updating by Hand

Example Walkthrough

Let’s revisit the coin-flip example:

  1. Start with prior: $p \sim \text{Beta}(2, 2)$ (you think the coin is roughly fair).

  2. Observe data: 3 heads, 1 tail → $k = 3, n = 4$.

  3. Update posterior:

    $$ p | D \sim \text{Beta}(2 + 3, 2 + 1) = \text{Beta}(5, 3) $$
  4. Posterior mean:

    $$ E[p|D] = \frac{5}{5 + 3} = 0.625 $$

So after 3 heads and 1 tail, your belief shifts — you now think the coin lands heads ~62.5% of the time.

If you had started with a stronger prior (say Beta(20,20)), the same data would move your belief much less.

The more confident your prior, the more “stubborn” your belief; the more data, the more persuasive the evidence.

💭 Probing Question:

“If you have little data, how does your choice of prior affect the posterior?”

Answer: When data is scarce, the prior dominates the posterior — your beliefs heavily shape your conclusions. As data grows, the likelihood (data evidence) overwhelms the prior, and everyone’s posterior beliefs converge.

Rule of Thumb:

  • Little data → Prior matters a lot (be cautious, or use weakly informative priors).
  • Plenty of data → Posterior ≈ Frequentist estimate.

🧠 Step 4: Assumptions or Key Ideas

  • Parameters are treated as random variables.
  • The prior should reflect genuine knowledge (or neutrality if unknown).
  • Posterior balances prior beliefs and observed data.
  • Conjugate priors simplify math but are not required — modern methods use MCMC for any prior.

⚖️ Step 5: Strengths, Limitations & Trade-offs

  • Naturally incorporates prior knowledge.
  • Provides full distributions, not just point estimates.
  • Adapts gracefully as new data arrives.
  • Choice of prior can bias results (especially with limited data).
  • Computation can be complex for high-dimensional models.
  • Interpretation requires careful probabilistic reasoning.
Bayesian inference trades simplicity for flexibility — it models uncertainty holistically but requires thoughtful priors and computational tools.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)
  • “The prior is just guessing.” → No — it encodes prior evidence, expert knowledge, or symmetry.
  • “Bayesian = subjective.” → Priors can be subjective, but the updating process is objective.
  • “Frequentist vs Bayesian is a competition.” → They’re complementary lenses — Bayesians quantify uncertainty explicitly, Frequentists rely on long-run frequency logic.

🧩 Step 7: Mini Summary

🧠 What You Learned: Bayesian inference updates beliefs about parameters as new data arrives, balancing prior assumptions with observed evidence.

⚙️ How It Works: Through Bayes’ theorem — the posterior ∝ prior × likelihood. Conjugate priors make this update mathematically elegant.

🎯 Why It Matters: This framework powers adaptive, uncertainty-aware models — vital in ML, from Naïve Bayes to Bayesian Neural Networks.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!