5.1. Bayesian Inference & Priors

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

Probability & Statistics for Data Science

6 min read 1121 words

🪄 Step 1: Intuition & Motivation

Core Idea: Bayesian inference is about learning from evidence — mathematically updating your beliefs when new data arrives. It answers:
“Given what I believed before, and what I just observed, what should I believe now?”
Simple Analogy: Imagine you’re guessing whether a coin is fair.
- Before flipping — you believe it’s probably fair (your prior).
- After a few flips — you update that belief using the data (your posterior).
- If you flip many times, your belief increasingly depends on data, not your prior.
Bayesian inference is like combining experience (prior) with evidence (data) to form updated wisdom (posterior).

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

At the heart of Bayesian inference lies Bayes’ theorem:

$$ P(\theta | D) = \frac{P(D | \theta) , P(\theta)}{P(D)} $$

where:

$P(\theta)$ = Prior (what we believe before seeing data)
$P(D | \theta)$ = Likelihood (how likely the data is, given our assumption)
$P(\theta | D)$ = Posterior (updated belief after seeing data)
$P(D)$ = Evidence (a normalizing constant ensuring probabilities sum to 1)

The posterior represents a compromise between what we thought and what we saw.

Why It Works This Way

The Bayesian approach treats everything as uncertain — even parameters! Instead of fixed values, parameters are random variables whose distributions evolve with evidence.

Frequentists say: “The parameter is fixed, data is random.” Bayesians say: “The data is fixed, but the parameter is uncertain.”

This shift allows us to reason probabilistically about our uncertainty in models themselves — a powerful perspective in data science and ML.

How It Fits in ML Thinking

Bayesian inference underpins many modern algorithms:

Naïve Bayes: Uses priors over classes.
Bayesian Networks: Graphical models built on conditional probabilities.
Regularization: Can be seen as imposing a prior belief (e.g., “weights should be small”).
Bayesian Optimization: Updates belief about a function’s shape using observed samples.

In ML terms, Bayesian reasoning = “belief updating with data.”

📐 Step 3: Mathematical Foundation

Let’s make the abstract concrete by working through examples.

🧩 1. Conjugate Priors

What They Are & Why They Matter

A conjugate prior is a prior distribution that, when combined with a likelihood, yields a posterior of the same family. This makes Bayesian updating analytically simple and elegant.

Example 1: Beta-Binomial Model (Coin Flips) If data comes from a Binomial likelihood (e.g., number of heads in coin flips), and the prior on the probability $p$ is Beta$(\alpha, \beta)$, then the posterior is also Beta:

$$ p | D \sim \text{Beta}(\alpha + k, , \beta + n - k) $$

where:

$k$ = number of successes (heads)
$n$ = number of trials

Interpretation: You start with “pseudo-counts” of $\alpha - 1$ heads and $\beta - 1$ tails. Each new observation updates these counts.

If $\alpha = \beta = 1$, that’s a uniform prior — no initial bias.

Example 2: Normal-Normal Model (Mean Estimation) For data with Gaussian noise, if prior on mean $\mu$ is $N(\mu_0, \sigma_0^2)$ and likelihood $Y_i \sim N(\mu, \sigma^2)$, then posterior is also normal:

$$ \mu | D \sim N\left( \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{Y}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, ; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} \right) $$

Conjugate priors are like plug-and-play beliefs — your posterior stays in the same “family,” just with updated parameters.

🎯 2. Posterior Predictive Distribution

The Next-Step of Bayesian Thinking

After updating your posterior belief about $\theta$, you might ask:

“Given my new belief, what do I expect to see next?”

This is the posterior predictive distribution:

$$ P(x_{new} | D) = \int P(x_{new} | \theta) , P(\theta | D) , d\theta $$

It averages predictions across all possible parameter values weighted by how plausible they are (from the posterior).

Interpretation: It’s a probabilistic “forecast” — accounting for both model uncertainty and data uncertainty.

Example: In the coin-flip Beta-Binomial case:

$$ P(\text{next flip = head} | D) = \frac{\alpha + k}{\alpha + \beta + n} $$

So your “updated” belief about the next event directly reflects your new counts of heads and tails.

Frequentists estimate one parameter and plug it in; Bayesians average over all possible parameters, weighted by belief.

🔄 3. Bayesian Updating by Hand

Example Walkthrough

Let’s revisit the coin-flip example:

Start with prior: $p \sim \text{Beta}(2, 2)$ (you think the coin is roughly fair).
Observe data: 3 heads, 1 tail → $k = 3, n = 4$.
Update posterior:
$$ p | D \sim \text{Beta}(2 + 3, 2 + 1) = \text{Beta}(5, 3) $$
Posterior mean:
$$ E[p|D] = \frac{5}{5 + 3} = 0.625 $$

So after 3 heads and 1 tail, your belief shifts — you now think the coin lands heads ~62.5% of the time.

If you had started with a stronger prior (say Beta(20,20)), the same data would move your belief much less.

The more confident your prior, the more “stubborn” your belief; the more data, the more persuasive the evidence.

💭 Probing Question:

“If you have little data, how does your choice of prior affect the posterior?”

Answer: When data is scarce, the prior dominates the posterior — your beliefs heavily shape your conclusions. As data grows, the likelihood (data evidence) overwhelms the prior, and everyone’s posterior beliefs converge.

Rule of Thumb:

Little data → Prior matters a lot (be cautious, or use weakly informative priors).
Plenty of data → Posterior ≈ Frequentist estimate.

🧠 Step 4: Assumptions or Key Ideas

Parameters are treated as random variables.
The prior should reflect genuine knowledge (or neutrality if unknown).
Posterior balances prior beliefs and observed data.
Conjugate priors simplify math but are not required — modern methods use MCMC for any prior.

⚖️ Step 5: Strengths, Limitations & Trade-offs

Naturally incorporates prior knowledge.
Provides full distributions, not just point estimates.
Adapts gracefully as new data arrives.

Choice of prior can bias results (especially with limited data).
Computation can be complex for high-dimensional models.
Interpretation requires careful probabilistic reasoning.

Bayesian inference trades simplicity for flexibility — it models uncertainty holistically but requires thoughtful priors and computational tools.

🚧 Step 6: Common Misunderstandings

🚨 Common Misunderstandings (Click to Expand)

“The prior is just guessing.” → No — it encodes prior evidence, expert knowledge, or symmetry.
“Bayesian = subjective.” → Priors can be subjective, but the updating process is objective.
“Frequentist vs Bayesian is a competition.” → They’re complementary lenses — Bayesians quantify uncertainty explicitly, Frequentists rely on long-run frequency logic.

🧩 Step 7: Mini Summary

🧠 What You Learned: Bayesian inference updates beliefs about parameters as new data arrives, balancing prior assumptions with observed evidence.

⚙️ How It Works: Through Bayes’ theorem — the posterior ∝ prior × likelihood. Conjugate priors make this update mathematically elegant.

🎯 Why It Matters: This framework powers adaptive, uncertainty-aware models — vital in ML, from Naïve Bayes to Bayesian Neural Networks.

5.2. Resampling & Validation 4.2. Simple Linear Regression (Statistical View)