Probability & Information Theory: Complete Study Guide for Interviews

Probability & Information Theory: Complete Study Guide for Interviews

11 min read 2272 words

1) Random variables & basic probability

🎯 Core Idea

Random variables (RVs) map outcomes of uncertain experiments to numbers. Probability measures how likely events (sets of outcomes) are. This is the language used to model uncertainty in ML.

🌱 Intuition & Real-World Analogy

  • Think of a random variable as a labeled spinner: the spinner (experiment) yields a number (RV value).
  • Probability is the β€œweight” you assign to each spinner wedge: larger wedge β†’ more likely value.

πŸ“ Mathematical Foundation

  • A discrete RV $X$ takes values $x\in\mathcal{X}$ with probability mass function (pmf) $P(X=x)=p(x)$, $p(x)\ge0$, $\sum_{x} p(x)=1$.

  • For continuous RVs, probability density function (pdf) $p(x)$ with $p(x)\ge0$ and $\int p(x),dx=1$. (Note: densities are not probabilities; integrals over sets are.)

  • Expectation:

    $$ \mathbb{E}[X] = \begin{cases} \sum_x x\,p(x) & (\text{discrete})\\[4pt] \int x\,p(x)\,dx & (\text{continuous}) \end{cases} $$
  • Variance:

    $$ \mathrm{Var}(X)=\mathbb{E}[(X-\mathbb{E}[X])^2]=\mathbb{E}[X^2]-\mathbb{E}[X]^2. $$

Assumptions: choice of discrete vs continuous model; existence of integrals/sums; measurability implicit.

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: compact, rigorous framework for uncertainty.
  • Limitation: real data may violate modeling choices (e.g., discrete approximation for continuous data).
  • Trade-off: model simplicity vs fidelity (e.g., Gaussian assumption is tractable but may misrepresent heavy tails).

πŸ” Variants & Extensions

  • Mixed discrete–continuous distributions, multivariate distributions, stochastic processes.

🚧 Common Challenges & Pitfalls

  • Confusing pdf values with probabilities.
  • Assuming independence incorrectly.
  • Not checking existence of moments (expectation/variance).

πŸ“š Reference Pointers


2) Joint, marginal, conditional probabilities & independence

🎯 Core Idea

Joint $p(x,y)$ describes two RVs together; marginal $p(x)$ is the distribution of one RV irrespective of others; conditional $p(y\mid x)$ describes one RV given another. Independence means $p(x,y)=p(x)p(y)$.

🌱 Intuition & Real-World Analogy

  • Joint = full seating chart for a party (who sits with whom).
  • Marginal = how many people are at the party irrespective of seating.
  • Conditional = probability someone sits at table A given they like jazz.

πŸ“ Mathematical Foundation

  • Joint: $p(x,y)$.

  • Marginalization:

    $$ p(x) = \sum_y p(x,y)\quad(\text{discrete}),\qquad p(x)=\int p(x,y)\,dy\quad(\text{continuous}). $$
  • Conditional:

    $$ p(y\mid x)=\frac{p(x,y)}{p(x)}\quad\text{if } p(x)>0. $$
  • Chain rule (factorization):

    $$ p(x_1,\dots,x_n)=p(x_1)\,p(x_2\mid x_1)\,p(x_3\mid x_1,x_2)\cdots p(x_n\mid x_1,\dots,x_{n-1}). $$
  • Independence: $X\perp Y \iff p(x,y)=p(x)p(y)$.

Assumptions: denominators nonzero for conditioning; factorization valid for any joint distribution.

βš–οΈ Strengths, Limitations & Trade-offs

  • Factorizations let you build models (PGMs).
  • But factorization choices encode assumptions β€” e.g., conditional independence assumptions can massively simplify models but risk misspecification.

πŸ” Variants & Extensions

  • Conditional independence, d-separation (in graphical models), Markov properties.

🚧 Common Challenges & Pitfalls

  • Mistaking uncorrelated for independent (not equivalent except Gaussian).
  • Improper marginalization order mistakes.

πŸ“š Reference Pointers

  • Koller & Friedman (PGM) β€” conditional independence and factorization.
  • Wikipedia entries for marginalization and conditional probability.

3) Entropy

🎯 Core Idea

Entropy $H(p)$ measures the average uncertainty (surprise) of a distribution $p$. Higher entropy β†’ more uncertainty.

🌱 Intuition & Real-World Analogy

  • Entropy β‰ˆ average number of yes/no questions needed to identify an outcome.
  • Analogy: a perfectly fair coin (max entropy for binary) vs a loaded coin (lower entropy).

πŸ“ Mathematical Foundation

For a discrete distribution $p(x)$:

$$ H(p) = -\sum_{x} p(x)\log p(x). $$

For continuous RVs, differential entropy $h(p)= -\int p(x)\log p(x),dx$ (has different properties).

Breakdown:

  • $p(x)$: probability of outcome $x$.
  • $\log p(x)$: log-likelihood (negative is surprise).
  • Negative expectation gives average surprise.

Properties:

  • $H(p)\ge0$ (discrete).
  • Maximum entropy for fixed support is achieved by uniform $p$.
  • Additivity for independent RVs: $H(X,Y)=H(X)+H(Y)$ if independent.

Assumptions: base of log (bits: base 2; nats: natural log) β€” convert accordingly.

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: fundamental scalar summary of uncertainty, used in coding and ML.
  • Limitation: single number β€” loses structure (doesn’t show which events cause uncertainty).
  • Trade-off: easy to compute for discrete small supports; in high-dimensional continuous settings, estimation is hard.

πŸ” Variants & Extensions

  • Conditional entropy $H(Y\mid X)$, joint entropy $H(X,Y)$, mutual information $I(X;Y)=H(X)-H(X\mid Y)$.

🚧 Common Challenges & Pitfalls

  • Confusing differential entropy with discrete entropy (differential can be negative).
  • Misinterpreting entropy as ‘complexity’ of data distribution without context.

πŸ“š Reference Pointers


4) Cross-entropy

🎯 Core Idea

Cross-entropy $H(p,q)$ measures the average number of bits needed to encode samples from true distribution $p$ using a code optimized for distribution $q$. It’s the expected negative log-likelihood under model $q$ when data comes from $p$.

🌱 Intuition & Real-World Analogy

  • Using a wrong spelling dictionary (model $q$) to compress a language produced by someone with a different dialect (true $p$) β€” you pay extra bits on average.

πŸ“ Mathematical Foundation

Discrete case:

$$ H(p,q) = -\sum_x p(x)\log q(x). $$

Relation to entropy and KL:

$$ H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q). $$

Breakdown:

  • $p(x)$: true distribution of data.
  • $q(x)$: model distribution (parameterized by network).
  • $-\log q(x)$: code length assigned by model $q$ to outcome $x$.
  • Expectation under $p$ gives average code length (cross-entropy).

Assumptions: $q(x)>0$ whenever $p(x)>0$ to avoid infinite cross-entropy.

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: directly ties to average negative log-likelihood (ML objective).
  • Limitation: depends on true entropy $H(p)$ which is unknown; training minimizes $H(p,q)$ via samples, equivalently minimizing KL.
  • Trade-off: optimizing cross-entropy doesn’t guarantee calibration (models can be overconfident).

πŸ” Variants & Extensions

  • Categorical cross-entropy (multiclass), binary cross-entropy (Bernoulli), negative log-likelihood for continuous distributions (e.g., Gaussian NLL).

🚧 Common Challenges & Pitfalls

  • Numerical instability when $q(x)$ is extremely small β†’ use log-sum-exp tricks.
  • Interpreting cross-entropy improvements without considering class imbalance or baseline entropy.

πŸ“š Reference Pointers


5) Kullback–Leibler (KL) divergence

🎯 Core Idea

KL divergence $D_{\mathrm{KL}}(p|q)$ measures how one probability distribution $q$ diverges from a reference distribution $p$. It is the expected extra log-loss when using $q$ instead of $p$.

🌱 Intuition & Real-World Analogy

  • KL is the extra number of bits per message you pay when compressing samples from $p$ using code for $q$.
  • Analogy: driving route $q$ vs optimal route $p$ β€” KL measures extra travel time expected if you follow $q$.

πŸ“ Mathematical Foundation

Discrete:

$$ D_{\mathrm{KL}}(p\|q)=\sum_x p(x)\log\frac{p(x)}{q(x)} = \mathbb{E}_{x\sim p}\left[\log\frac{p(x)}{q(x)}\right]. $$

Properties:

  • $D_{\mathrm{KL}}(p|q)\ge0$ (Gibbs’ inequality), equality iff $p=q$ almost everywhere.
  • Asymmetric: $D_{\mathrm{KL}}(p|q)\ne D_{\mathrm{KL}}(q|p)$.

Relation to entropy:

$$ D_{\mathrm{KL}}(p\|q) = H(p,q) - H(p). $$

Assumptions: $q(x)>0$ wherever $p(x)>0$; otherwise $D_{\mathrm{KL}}(p|q)=\infty$.

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: interpretable expected log-loss; tractable gradients for ML.
  • Limitation: not a metric (asymmetric), infinite when supports mismatch.
  • Trade-off: choosing direction matters (mode-covering vs mode-seeking behaviors in approximations).

πŸ” Variants & Extensions

  • Reverse KL $D_{\mathrm{KL}}(q|p)$.
  • Jensen–Shannon divergence (symmetrized, finite).
  • RΓ©nyi divergence family.

🚧 Common Challenges & Pitfalls

  • Using $D_{\mathrm{KL}}(q|p)$ vs $D_{\mathrm{KL}}(p|q)$ has very different behavior in approximation tasks (e.g., variational inference uses $D_{\mathrm{KL}}(q|p)$ causing mode-covering/averaging effects; expectation-propagation uses other divergences).
  • Ignoring support mismatch leads to infinities.

πŸ“š Reference Pointers

  • Cover & Thomas (information theory).
  • Wikipedia: Kullback–Leibler divergence.

6) Why cross-entropy / negative log-likelihood (NLL) is the natural loss for classification

🎯 Core Idea

Minimizing cross-entropy (or equivalently KL divergence from empirical data to model) equals maximizing the likelihood of observed labels under the model β€” which is the statistically natural objective if we assume data are i.i.d. samples from an unknown true distribution.

🌱 Intuition & Real-World Analogy

  • If your model $q$ assigns high probability to the labels that actually occur, the average negative log probability (cross-entropy) is low β€” meaning your model predicts what really happens.
  • Analogy: training is like teaching a weather forecaster (model) to assign high probability to events that actually occur; you punish them proportionally to how unlikely they made the event.

πŸ“ Mathematical Foundation

Let dataset

$$\mathcal{D}=\{(x_i,y_i)\}_{i=1}^N$$

be i.i.d. from true joint $p_{\text{data}}(x,y)$. We model $q_\theta(y\mid x)$ (classification). The (empirical) negative log-likelihood:

$$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log q_\theta(y_i\mid x_i). $$

As $N\to\infty$, empirical average approximates expectation under $p_{\text{data}}$:

$$ \mathcal{L}(\theta) \xrightarrow{N\to\infty} \mathbb{E}_{(x,y)\sim p_{\text{data}}}[-\log q_\theta(y\mid x)] = H\big(p_{\text{data}}(Y\mid X), q_\theta(Y\mid X)\big). $$

Using identity:

$$ H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q), $$

we get

$$ \mathbb{E}[-\log q_\theta] = H\big(p_{\text{data}}(Y\mid X)\big) + \mathbb{E}_{x}\big[D_{\mathrm{KL}}(p_{\text{data}}(Y\mid x)\,\|\,q_\theta(Y\mid x))\big]. $$

Since $H(p_{\text{data}})$ is fixed (independent of $\theta$), minimizing $\mathcal{L}(\theta)$ is equivalent to minimizing the expected KL divergence between true conditional distribution and model β€” i.e., making $q_\theta$ match the true conditional distribution. That is the statistical justification.

Derivation steps (compact):

  1. Empirical NLL:
$$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_i \log q_\theta(y_i\mid x_i). $$
  1. Take expectation under true distribution (population loss):
$$ \mathcal{L}(\theta) \approx \mathbb{E}_{p_{\text{data}}(x,y)}[-\log q_\theta(y\mid x)]. $$
  1. Recognize this is conditional cross-entropy $H(p_{\text{data}}(Y\mid X), q_\theta(Y\mid X))$.

  2. Use identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$ applied per $x$ and average over $x$.

Therefore minimizing $\mathcal{L}$ minimizes expected KL divergence; ML is therefore the right objective if you want $q_\theta$ close to the true conditional.

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: principled (MLE), convex in linear models with exponential family outputs (e.g., logistic regression), yields probabilistic outputs.
  • Limitation: MLE can overfit (hence regularization needed); cross-entropy does not directly optimize calibrated probabilities unless model is well-specified.
  • Trade-off: optimizing NLL focuses on probabilistic correctness (minimizing KL), while other objectives (e.g., accuracy, F1) optimize different downstream metrics β€” you may need to trade proper probabilistic loss for task-specific metrics.

πŸ” Variants & Extensions

  • For binary classification: binary cross-entropy = Bernoulli NLL.
  • For multiclass: categorical cross-entropy with softmax output.
  • Label smoothing: replace one-hot targets with smoothed targets to regularize (reduces overconfidence).
  • Focal loss: reweighting for class imbalance β€” not ML-consistent but practical.

🚧 Common Challenges & Pitfalls

  • Using cross-entropy but evaluating only accuracy can hide calibration issues.
  • Softmax + NLL encourages probability mass on the true class but can be overconfident; label smoothing helps.
  • Support mismatch: if model assigns zero prob to true label, loss is infinite (numerical issues).
  • Class imbalance: minimizing NLL may focus on dominant classes β†’ consider reweighting or alternative objectives.

πŸ“š Reference Pointers

  • Bishop, PRML: connection between likelihood and loss.
  • Goodfellow et al., Deep Learning: training deep models and loss functions. https://www.deeplearningbook.org

7) Softmax + Categorical cross-entropy: exact derivation used in classifiers

🎯 Core Idea

For $K$-class classification, model outputs logits $z_k$; softmax converts logits to probabilities $q(k\mid x)=\frac{e^{z_k}}{\sum_j e^{z_j}}$. The categorical cross-entropy (CCE) is the negative log-probability assigned to the true class.

🌱 Intuition & Real-World Analogy

  • Softmax is a smooth β€œargmax” that produces a probability distribution; CCE penalizes how unlikely the model made the actual class.

πŸ“ Mathematical Foundation

Softmax:

$$ q_\theta(y=k\mid x)=\mathrm{softmax}(z)_k=\frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}. $$

If true label is represented as one-hot vector $y$ (with 1 at $k^*$), categorical cross-entropy per example:

$$ \ell(\theta) = -\sum_{k=1}^K y_k \log q_\theta(k\mid x) = -\log q_\theta(k^*\mid x). $$

Gradient w.r.t. logits $z$ (useful to know intuitively):

$$ \frac{\partial \ell}{\partial z_j} = q_\theta(j\mid x) - y_j. $$

This clean gradient explains why softmax+CCE is convenient for backpropagation.

Assumptions: one-hot true labels (or one can use soft labels).

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: numerically/analytically convenient gradients; yields probabilistic outputs.
  • Limitation: numerical stability (use log-sum-exp). Softmax makes outputs dense probability vectors (no zeros) unless numerically underflow.
  • Trade-off: softmax mixes logits β€” large margin alternatives (e.g., hinge loss) behave differently.

πŸ” Variants & Extensions

  • Temperature scaling on softmax: $\mathrm{softmax}(z/T)$ for calibration.
  • Sparsemax (produces sparse probabilities).
  • Margin softmax (ArcFace, CosFace) for face recognition tasks (add margin to logits).

🚧 Common Challenges & Pitfalls

  • Not using stable softmax implementations β†’ overflow/underflow.
  • Ignoring temperature or label smoothing effects on calibration.

πŸ“š Reference Pointers

  • Goodfellow et al., Deep Learning β€” softmax and cross-entropy section.

8) Mutual information & conditional variants (brief)

🎯 Core Idea

Mutual information $I(X;Y)$ quantifies the reduction in uncertainty of $X$ given knowledge of $Y$: $I(X;Y)=H(X)-H(X\mid Y)$.

🌱 Intuition & Real-World Analogy

  • How much knowing the weather forecast reduces your uncertainty about the chance of carrying an umbrella.

πŸ“ Mathematical Foundation

$$ I(X;Y)=\sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)} = D_{\mathrm{KL}}(p(x,y)\,\|\,p(x)p(y)). $$

βš–οΈ Strengths, Limitations & Trade-offs

  • Strength: model-agnostic dependency measure; symmetric.
  • Limitation: hard to estimate reliably in high dimensions from finite data.

πŸ” Variants & Extensions

  • Conditional mutual information $I(X;Y\mid Z)$.
  • Use in representation learning (InfoMax, contrastive learning objectives approximate MI).

πŸ“š Reference Pointers

  • Cover & Thomas; InfoMax literature.

9) Practical considerations, numerical stability & ML engineering tips

  • Always compute losses in log-space where possible; use log-sum-exp trick:

    $$ \log\sum_j e^{z_j} = m + \log\sum_j e^{z_j-m},\quad m=\max_j z_j. $$
  • Compute gradients: for softmax + CCE, $\nabla_z \ell = q - y$ β€” implement stable softmax then apply this gradient formula.

  • Use label smoothing to avoid extreme confidence; it’s equivalent to mixing true $p$ with uniform prior (reduces KL).

  • For class imbalance, consider class weights or focal loss.

  • For calibration, consider temperature scaling (post-hoc) or regularization/meta-losses.


10) Common interview pitfalls & things to emphasize when explaining

  • Be explicit about assumptions: i.i.d. data, support overlap (no zeros in model where data exists), log base (nats vs bits).
  • Distinguish cross-entropy (expected negative log under $q$) vs entropy (intrinsic to data) vs KL (gap between them).
  • Explain why minimizing cross-entropy = minimizing KL which equals making model close to true conditional; link to MLE.
  • Mention directionality of KL and consequences (mode-seeking vs mode-covering).
  • Talk about numerical stability and how to implement softmax + CCE safely.
  • Discuss practical regularization (label smoothing, temperature scaling) and how they affect calibration and generalization.

11) Short list of important formulas (single page summary)

  • Entropy: $H(p) = -\sum_x p(x)\log p(x)$.
  • Cross-entropy: $H(p,q) = -\sum_x p(x)\log q(x)$.
  • KL divergence: $D_{\mathrm{KL}}(p|q)=\sum_x p(x)\log\frac{p(x)}{q(x)}$.
  • Relation: $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$.
  • Softmax: $q(k)=\frac{e^{z_k}}{\sum_j e^{z_j}}$.
  • Categorical cross-entropy (one-hot): $\ell=-\log q(k^*)$.
  • Gradient: $\frac{\partial \ell}{\partial z_j} = q(j)-y_j$.

12) Further reading (precise links)


Final quick checklist (for interview answers)

  • Define entropy, cross-entropy, KL with formulas and interpretations.
  • Show identity $H(p,q)=H(p)+D_{\mathrm{KL}}(p|q)$ and explain implications.
  • Derive softmax + CCE gradient $q-y$ (short derivation is fine).
  • Relate cross-entropy minimization to MLE and expected KL minimization.
  • Mention numerical stability and label smoothing as practical fixes.

Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!