Probability & Information Theory: Complete Study Guide for Interviews

Generative AI & LLM Interview Guide for Top Roles (2025)

Theories

11 min read 2272 words

1) Random variables & basic probability

🎯 Core Idea

Random variables (RVs) map outcomes of uncertain experiments to numbers. Probability measures how likely events (sets of outcomes) are. This is the language used to model uncertainty in ML.

🌱 Intuition & Real-World Analogy

Think of a random variable as a labeled spinner: the spinner (experiment) yields a number (RV value).
Probability is the “weight” you assign to each spinner wedge: larger wedge → more likely value.

📐 Mathematical Foundation

A discrete RV $X$ takes values $x\in\mathcal{X}$ with probability mass function (pmf) $P(X=x)=p(x)$, $p(x)\ge0$, $\sum_{x} p(x)=1$.
For continuous RVs, probability density function (pdf) $p(x)$ with $p(x)\ge0$ and $\int p(x),dx=1$. (Note: densities are not probabilities; integrals over sets are.)
Expectation:
$$ \mathbb{E}[X] = \begin{cases} \sum_x x\,p(x) & (\text{discrete})\\[4pt] \int x\,p(x)\,dx & (\text{continuous}) \end{cases} $$
Variance:
$$ \mathrm{Var}(X)=\mathbb{E}[(X-\mathbb{E}[X])^2]=\mathbb{E}[X^2]-\mathbb{E}[X]^2. $$

Assumptions: choice of discrete vs continuous model; existence of integrals/sums; measurability implicit.

⚖️ Strengths, Limitations & Trade-offs

Strength: compact, rigorous framework for uncertainty.
Limitation: real data may violate modeling choices (e.g., discrete approximation for continuous data).
Trade-off: model simplicity vs fidelity (e.g., Gaussian assumption is tractable but may misrepresent heavy tails).

🔍 Variants & Extensions

Mixed discrete–continuous distributions, multivariate distributions, stochastic processes.

🚧 Common Challenges & Pitfalls

Confusing pdf values with probabilities.
Assuming independence incorrectly.
Not checking existence of moments (expectation/variance).

📚 Reference Pointers

Bishop, Pattern Recognition and Machine Learning — basic probability refresh.
Koller & Friedman (PGM) for probabilistic modeling principles. https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models

2) Joint, marginal, conditional probabilities & independence

🎯 Core Idea

Joint $p(x,y)$ describes two RVs together; marginal $p(x)$ is the distribution of one RV irrespective of others; conditional $p(y\mid x)$ describes one RV given another. Independence means $p(x,y)=p(x)p(y)$.

🌱 Intuition & Real-World Analogy

Joint = full seating chart for a party (who sits with whom).
Marginal = how many people are at the party irrespective of seating.
Conditional = probability someone sits at table A given they like jazz.

📐 Mathematical Foundation

Joint: $p(x,y)$.
Marginalization:
$$ p(x) = \sum_y p(x,y)\quad(\text{discrete}),\qquad p(x)=\int p(x,y)\,dy\quad(\text{continuous}). $$
Conditional:
$$ p(y\mid x)=\frac{p(x,y)}{p(x)}\quad\text{if } p(x)>0. $$
Chain rule (factorization):
$$ p(x_1,\dots,x_n)=p(x_1)\,p(x_2\mid x_1)\,p(x_3\mid x_1,x_2)\cdots p(x_n\mid x_1,\dots,x_{n-1}). $$
Independence: $X\perp Y \iff p(x,y)=p(x)p(y)$.

Assumptions: denominators nonzero for conditioning; factorization valid for any joint distribution.

⚖️ Strengths, Limitations & Trade-offs

Factorizations let you build models (PGMs).
But factorization choices encode assumptions — e.g., conditional independence assumptions can massively simplify models but risk misspecification.

🔍 Variants & Extensions

Conditional independence, d-separation (in graphical models), Markov properties.

🚧 Common Challenges & Pitfalls

Mistaking uncorrelated for independent (not equivalent except Gaussian).
Improper marginalization order mistakes.

📚 Reference Pointers

Koller & Friedman (PGM) — conditional independence and factorization.
Wikipedia entries for marginalization and conditional probability.

3) Entropy

🎯 Core Idea

Entropy $H(p)$ measures the average uncertainty (surprise) of a distribution $p$. Higher entropy → more uncertainty.

🌱 Intuition & Real-World Analogy

Entropy ≈ average number of yes/no questions needed to identify an outcome.
Analogy: a perfectly fair coin (max entropy for binary) vs a loaded coin (lower entropy).

📐 Mathematical Foundation

For a discrete distribution $p(x)$:

$$ H(p) = -\sum_{x} p(x)\log p(x). $$

For continuous RVs, differential entropy $h(p)= -\int p(x)\log p(x),dx$ (has different properties).

Breakdown:

$p(x)$: probability of outcome $x$.
$\log p(x)$: log-likelihood (negative is surprise).
Negative expectation gives average surprise.

Properties:

$H(p)\ge0$ (discrete).
Maximum entropy for fixed support is achieved by uniform $p$.
Additivity for independent RVs: $H(X,Y)=H(X)+H(Y)$ if independent.

Assumptions: base of log (bits: base 2; nats: natural log) — convert accordingly.

⚖️ Strengths, Limitations & Trade-offs

Strength: fundamental scalar summary of uncertainty, used in coding and ML.
Limitation: single number — loses structure (doesn’t show which events cause uncertainty).
Trade-off: easy to compute for discrete small supports; in high-dimensional continuous settings, estimation is hard.

🔍 Variants & Extensions

Conditional entropy $H(Y\mid X)$, joint entropy $H(X,Y)$, mutual information $I(X;Y)=H(X)-H(X\mid Y)$.

🚧 Common Challenges & Pitfalls

Confusing differential entropy with discrete entropy (differential can be negative).
Misinterpreting entropy as ‘complexity’ of data distribution without context.

📚 Reference Pointers

Cover & Thomas, Elements of Information Theory (essential). https://onlinelibrary.wiley.com/doi/book/10.1002/047174882X

4) Cross-entropy

🎯 Core Idea

Cross-entropy $H(p,q)$ measures the average number of bits needed to encode samples from true distribution $p$ using a code optimized for distribution $q$. It’s the expected negative log-likelihood under model $q$ when data comes from $p$.

🌱 Intuition & Real-World Analogy

Using a wrong spelling dictionary (model $q$) to compress a language produced by someone with a different dialect (true $p$) — you pay extra bits on average.

📐 Mathematical Foundation

Discrete case:

$$ H(p,q) = -\sum_x p(x)\log q(x). $$

Relation to entropy and KL:

$$ H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q). $$

Breakdown:

$p(x)$: true distribution of data.
$q(x)$: model distribution (parameterized by network).
$-\log q(x)$: code length assigned by model $q$ to outcome $x$.
Expectation under $p$ gives average code length (cross-entropy).

Assumptions: $q(x)>0$ whenever $p(x)>0$ to avoid infinite cross-entropy.

⚖️ Strengths, Limitations & Trade-offs

Strength: directly ties to average negative log-likelihood (ML objective).
Limitation: depends on true entropy $H(p)$ which is unknown; training minimizes $H(p,q)$ via samples, equivalently minimizing KL.
Trade-off: optimizing cross-entropy doesn’t guarantee calibration (models can be overconfident).

🔍 Variants & Extensions

Categorical cross-entropy (multiclass), binary cross-entropy (Bernoulli), negative log-likelihood for continuous distributions (e.g., Gaussian NLL).

🚧 Common Challenges & Pitfalls

Numerical instability when $q(x)$ is extremely small → use log-sum-exp tricks.
Interpreting cross-entropy improvements without considering class imbalance or baseline entropy.

📚 Reference Pointers

Goodfellow, Bengio & Courville, Deep Learning (chapter on training); https://www.deeplearningbook.org
Wikipedia: Cross-entropy.

5) Kullback–Leibler (KL) divergence

🎯 Core Idea

KL divergence $D_{\mathrm{KL}}(p|q)$ measures how one probability distribution $q$ diverges from a reference distribution $p$. It is the expected extra log-loss when using $q$ instead of $p$.

🌱 Intuition & Real-World Analogy

KL is the extra number of bits per message you pay when compressing samples from $p$ using code for $q$.
Analogy: driving route $q$ vs optimal route $p$ — KL measures extra travel time expected if you follow $q$.

📐 Mathematical Foundation

Discrete:

$$ D_{\mathrm{KL}}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)} = \mathbb{E}_{x\sim p}\left[\log\frac{p(x)}{q(x)}\right]. $$

Properties:

$D_{\mathrm{KL}}(p|q)\ge0$ (Gibbs’ inequality), equality iff $p=q$ almost everywhere.
Asymmetric: $D_{\mathrm{KL}}(p|q)\ne D_{\mathrm{KL}}(q|p)$.

Relation to entropy:

$$ D_{\mathrm{KL}}(p\|q) = H(p,q) - H(p). $$

Assumptions: $q(x)>0$ wherever $p(x)>0$; otherwise $D_{\mathrm{KL}}(p|q)=\infty$.

⚖️ Strengths, Limitations & Trade-offs

Strength: interpretable expected log-loss; tractable gradients for ML.
Limitation: not a metric (asymmetric), infinite when supports mismatch.
Trade-off: choosing direction matters (mode-covering vs mode-seeking behaviors in approximations).

🔍 Variants & Extensions

Reverse KL $D_{\mathrm{KL}}(q|p)$.
Jensen–Shannon divergence (symmetrized, finite).
Rényi divergence family.

🚧 Common Challenges & Pitfalls

Using $D_{\mathrm{KL}}(q|p)$ vs $D_{\mathrm{KL}}(p|q)$ has very different behavior in approximation tasks (e.g., variational inference uses $D_{\mathrm{KL}}(q|p)$ causing mode-covering/averaging effects; expectation-propagation uses other divergences).
Ignoring support mismatch leads to infinities.

📚 Reference Pointers

Cover & Thomas (information theory).
Wikipedia: Kullback–Leibler divergence.

6) Why cross-entropy / negative log-likelihood (NLL) is the natural loss for classification

🎯 Core Idea

Minimizing cross-entropy (or equivalently KL divergence from empirical data to model) equals maximizing the likelihood of observed labels under the model — which is the statistically natural objective if we assume data are i.i.d. samples from an unknown true distribution.

🌱 Intuition & Real-World Analogy

If your model $q$ assigns high probability to the labels that actually occur, the average negative log probability (cross-entropy) is low — meaning your model predicts what really happens.
Analogy: training is like teaching a weather forecaster (model) to assign high probability to events that actually occur; you punish them proportionally to how unlikely they made the event.

📐 Mathematical Foundation

Let dataset

$$\mathcal{D}=\{(x_i,y_i)\}_{i=1}^N$$

be i.i.d. from true joint $p_{\text{data}}(x,y)$. We model $q_\theta(y\mid x)$ (classification). The (empirical) negative log-likelihood:

$$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log q_\theta(y_i\mid x_i). $$

As $N\to\infty$, empirical average approximates expectation under $p_{\text{data}}$:

$$ \mathcal{L}(\theta) \xrightarrow{N\to\infty} \mathbb{E}_{(x,y)\sim p_{\text{data}}}[-\log q_\theta(y\mid x)] = H\big(p_{\text{data}}(Y\mid X), q_\theta(Y\mid X)\big). $$

Using identity:

$$ H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q), $$

we get

$$ \mathbb{E}[-\log q_\theta] = H\big(p_{\text{data}}(Y\mid X)\big) + \mathbb{E}_{x}\big[D_{\mathrm{KL}}(p_{\text{data}}(Y\mid x)\,\|\,q_\theta(Y\mid x))\big]. $$

Since $H(p_{\text{data}})$ is fixed (independent of $\theta$), minimizing $\mathcal{L}(\theta)$ is equivalent to minimizing the expected KL divergence between true conditional distribution and model — i.e., making $q_\theta$ match the true conditional distribution. That is the statistical justification.

Derivation steps (compact):

Empirical NLL:

$$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_i \log q_\theta(y_i\mid x_i). $$

Take expectation under true distribution (population loss):

$$ \mathcal{L}(\theta) \approx \mathbb{E}_{p_{\text{data}}(x,y)}[-\log q_\theta(y\mid x)]. $$

Recognize this is conditional cross-entropy $H(p_{\text{data}}(Y\mid X), q_\theta(Y\mid X))$.
Use identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$ applied per $x$ and average over $x$.

Therefore minimizing $\mathcal{L}$ minimizes expected KL divergence; ML is therefore the right objective if you want $q_\theta$ close to the true conditional.

⚖️ Strengths, Limitations & Trade-offs

Strength: principled (MLE), convex in linear models with exponential family outputs (e.g., logistic regression), yields probabilistic outputs.
Limitation: MLE can overfit (hence regularization needed); cross-entropy does not directly optimize calibrated probabilities unless model is well-specified.
Trade-off: optimizing NLL focuses on probabilistic correctness (minimizing KL), while other objectives (e.g., accuracy, F1) optimize different downstream metrics — you may need to trade proper probabilistic loss for task-specific metrics.

🔍 Variants & Extensions

For binary classification: binary cross-entropy = Bernoulli NLL.
For multiclass: categorical cross-entropy with softmax output.
Label smoothing: replace one-hot targets with smoothed targets to regularize (reduces overconfidence).
Focal loss: reweighting for class imbalance — not ML-consistent but practical.

🚧 Common Challenges & Pitfalls

Using cross-entropy but evaluating only accuracy can hide calibration issues.
Softmax + NLL encourages probability mass on the true class but can be overconfident; label smoothing helps.
Support mismatch: if model assigns zero prob to true label, loss is infinite (numerical issues).
Class imbalance: minimizing NLL may focus on dominant classes → consider reweighting or alternative objectives.

📚 Reference Pointers

Bishop, PRML: connection between likelihood and loss.
Goodfellow et al., Deep Learning: training deep models and loss functions. https://www.deeplearningbook.org

7) Softmax + Categorical cross-entropy: exact derivation used in classifiers

🎯 Core Idea

For $K$-class classification, model outputs logits $z_k$; softmax converts logits to probabilities $q(k\mid x)=\frac{e^{z_k}}{\sum_j e^{z_j}}$. The categorical cross-entropy (CCE) is the negative log-probability assigned to the true class.

🌱 Intuition & Real-World Analogy

Softmax is a smooth “argmax” that produces a probability distribution; CCE penalizes how unlikely the model made the actual class.

📐 Mathematical Foundation

Softmax:

$$ q_\theta(y=k\mid x)=\mathrm{softmax}(z)_k=\frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}. $$

If true label is represented as one-hot vector $y$ (with 1 at $k^*$), categorical cross-entropy per example:

$$ \ell(\theta) = -\sum_{k=1}^K y_k \log q_\theta(k\mid x) = -\log q_\theta(k^*\mid x). $$

Gradient w.r.t. logits $z$ (useful to know intuitively):

$$ \frac{\partial \ell}{\partial z_j} = q_\theta(j\mid x) - y_j. $$

This clean gradient explains why softmax+CCE is convenient for backpropagation.

Assumptions: one-hot true labels (or one can use soft labels).

⚖️ Strengths, Limitations & Trade-offs

Strength: numerically/analytically convenient gradients; yields probabilistic outputs.
Limitation: numerical stability (use log-sum-exp). Softmax makes outputs dense probability vectors (no zeros) unless numerically underflow.
Trade-off: softmax mixes logits — large margin alternatives (e.g., hinge loss) behave differently.

🔍 Variants & Extensions

Temperature scaling on softmax: $\mathrm{softmax}(z/T)$ for calibration.
Sparsemax (produces sparse probabilities).
Margin softmax (ArcFace, CosFace) for face recognition tasks (add margin to logits).

🚧 Common Challenges & Pitfalls

Not using stable softmax implementations → overflow/underflow.
Ignoring temperature or label smoothing effects on calibration.

📚 Reference Pointers

Goodfellow et al., Deep Learning — softmax and cross-entropy section.

8) Mutual information & conditional variants (brief)

🎯 Core Idea

Mutual information $I(X;Y)$ quantifies the reduction in uncertainty of $X$ given knowledge of $Y$: $I(X;Y)=H(X)-H(X\mid Y)$.

🌱 Intuition & Real-World Analogy

How much knowing the weather forecast reduces your uncertainty about the chance of carrying an umbrella.

📐 Mathematical Foundation

$$ I(X;Y)=\sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)} = D_{\mathrm{KL}}(p(x,y)\,\|\,p(x)p(y)). $$

⚖️ Strengths, Limitations & Trade-offs

Strength: model-agnostic dependency measure; symmetric.
Limitation: hard to estimate reliably in high dimensions from finite data.

🔍 Variants & Extensions

Conditional mutual information $I(X;Y\mid Z)$.
Use in representation learning (InfoMax, contrastive learning objectives approximate MI).

📚 Reference Pointers

Cover & Thomas; InfoMax literature.

9) Practical considerations, numerical stability & ML engineering tips

Always compute losses in log-space where possible; use log-sum-exp trick:
$$ \log\sum_j e^{z_j} = m + \log\sum_j e^{z_j-m},\quad m=\max_j z_j. $$
Compute gradients: for softmax + CCE, $\nabla_z \ell = q - y$ — implement stable softmax then apply this gradient formula.
Use label smoothing to avoid extreme confidence; it’s equivalent to mixing true $p$ with uniform prior (reduces KL).
For class imbalance, consider class weights or focal loss.
For calibration, consider temperature scaling (post-hoc) or regularization/meta-losses.

10) Common interview pitfalls & things to emphasize when explaining

Be explicit about assumptions: i.i.d. data, support overlap (no zeros in model where data exists), log base (nats vs bits).
Distinguish cross-entropy (expected negative log under $q$) vs entropy (intrinsic to data) vs KL (gap between them).
Explain why minimizing cross-entropy = minimizing KL which equals making model close to true conditional; link to MLE.
Mention directionality of KL and consequences (mode-seeking vs mode-covering).
Talk about numerical stability and how to implement softmax + CCE safely.
Discuss practical regularization (label smoothing, temperature scaling) and how they affect calibration and generalization.

11) Short list of important formulas (single page summary)

Entropy: $H(p) = -\sum_x p(x)\log p(x)$.
Cross-entropy: $H(p,q) = -\sum_x p(x)\log q(x)$.
KL divergence: $D_{\mathrm{KL}}(p|q)=\sum_x p(x)\log\frac{p(x)}{q(x)}$.
Relation: $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$.
Softmax: $q(k)=\frac{e^{z_k}}{\sum_j e^{z_j}}$.
Categorical cross-entropy (one-hot): $\ell=-\log q(k^*)$.
Gradient: $\frac{\partial \ell}{\partial z_j} = q(j)-y_j$.

12) Further reading (precise links)

Cover, T. M. & Thomas, J. A., Elements of Information Theory. (Core textbook). https://onlinelibrary.wiley.com/doi/book/10.1002/047174882X
Koller, D. & Friedman, N., Probabilistic Graphical Models. (For conditional independence, factorization). https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models
Goodfellow, I., Bengio, Y., Courville, A., Deep Learning (esp. chapters on optimization and probabilistic outputs). https://www.deeplearningbook.org
Bishop, C. M., Pattern Recognition and Machine Learning. (MLE, loss functions). http://research.microsoft.com/~cmbishop/PRML/

Final quick checklist (for interview answers)

Define entropy, cross-entropy, KL with formulas and interpretations.
Show identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$ and explain implications.
Derive softmax + CCE gradient $q-y$ (short derivation is fine).
Relate cross-entropy minimization to MLE and expected KL minimization.
Mention numerical stability and label smoothing as practical fixes.