p-values and Confidence Intervals: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

5 min read 979 words

🎯 Core Idea

P-values and confidence intervals (CIs) quantify uncertainty about regression coefficients.
- A p-value measures how surprising the observed coefficient (or a more extreme one) is under a null hypothesis (usually that the coefficient = 0).
- A confidence interval gives a range of coefficient values consistent with the data at a chosen confidence level (e.g., 95%).
In regression, they help decide whether an estimated predictor effect is distinguishable from sampling noise and provide a range for plausible effect sizes.

🌱 Intuition & Real-World Analogy

Why we use them: When we fit a regression, the point estimate (β̂) is noisy because our data are one sample from a larger population. P-values & CIs quantify that sampling noise so we can reason about whether the signal is real and how large it might be.
Analogy 1 — Fishing with noisy sonar: Suppose β̂ is the blip you see on sonar. The p-value answers: “If there were no fish, how often would you see a blip at least this strong?” The CI answers: “Given the blip and noise, between what depths could that fish plausibly be?”
Analogy 2 — Weather forecast: A point estimate is like “it will be 25°C”; a CI is like “it will be 23–27°C with 95% confidence.” The p-value is like asking “how surprising would 25°C be if the long-term average were 20°C?”

📐 Mathematical Foundation

Let the linear model be

$$ y = X\beta + \varepsilon,\qquad \varepsilon \sim (0,\,\sigma^2 I) $$

with $X\in\mathbb{R}^{n\times p}$ (full rank assumed), OLS estimator:

$$ \hat\beta = (X^\top X)^{-1}X^\top y. $$

Variance of the estimator

$$ \operatorname{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. $$

Estimated residual variance:

$$ \hat\sigma^2 = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat y_i)^2. $$

Standard error for coefficient $j$

$$ \mathrm{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2\,[ (X^\top X)^{-1} ]_{jj}}. $$

t-statistic (for testing $H_0: \beta_j = \beta_{j,0}$, commonly $\beta_{j,0}=0$)

$$ t_j = \frac{\hat\beta_j - \beta_{j,0}}{\mathrm{SE}(\hat\beta_j)}. $$

Under the classical assumptions (including Gaussian errors), $t_j \sim t_{n-p}$ under $H_0$. For large $n$, the $t$-distribution ≈ standard normal.

Two-sided p-value

$$ \text{p-value} = 2\Pr\big( T_{n-p} \ge |t_j| \big). $$

(1 − α) Confidence Interval for $\beta_j$

$$ \hat\beta_j \pm t_{n-p,\,1-\alpha/2}\,\mathrm{SE}(\hat\beta_j), $$

where $t_{n-p,,1-\alpha/2}$ is the $(1-\alpha/2)$-quantile of the $t$-distribution with $n-p$ degrees of freedom.

Derivation sketch (why CI has that form): Under assumptions, $(\hat\beta_j - \beta_j)/\mathrm{SE}(\hat\beta_j)$ follows a $t_{n-p}$ distribution. Hence probability

$$ \Pr\big( -t_{n-p,1-\alpha/2} \le \frac{\hat\beta_j-\beta_j}{\mathrm{SE}(\hat\beta_j)} \le t_{n-p,1-\alpha/2} \big) = 1-\alpha, $$

which rearranges to the CI expression.

⚖️ Strengths, Limitations & Trade-offs

Strengths

Direct uncertainty quantification for parameter estimates — helps assess evidence and estimate ranges for effect sizes.
Closed-form in OLS under classical assumptions — computationally cheap and interpretable.
Well-understood sampling properties (e.g., coverage probability of CI under model assumptions).

Limitations & trade-offs

Dependence on assumptions: Validity requires correct model specification (linearity), independence, homoskedasticity (constant variance), and often normality for small samples. Violations bias SEs and p-values.
Multiple testing: In models with many predictors, naïve p-values inflate false positives unless corrected (Bonferroni, FDR, etc.).
Practical vs statistical significance: Very small effects can have tiny p-values in huge datasets but be irrelevant in practice. Conversely, meaningful effects may have non-significant p-values on small samples.
Model selection bias: Doing variable selection (stepwise, peeking, p-hacking) invalidates standard p-values and CIs — they no longer reflect the true post-selection uncertainty.
Asymptotic reliance: For complex models or small samples, asymptotic approximations (normality) may be poor.

🔍 Variants & Extensions

Heteroskedasticity-robust SEs (White SEs): Adjust standard errors so inference is valid under heteroskedastic errors. Useful when $\operatorname{Var}(\varepsilon_i)$ is not constant.
Clustered SEs: Account for within-cluster correlation (e.g., panel data, grouped data). Replace $I$ with block-structured variance estimator.
Bootstrap CIs / p-values: Nonparametric bootstrap or percentile/t-bootstrap for complex estimators or small samples — relaxes distributional assumptions.
Likelihood-based intervals (Wald, Score, Likelihood-ratio): In generalized linear models (GLMs) and maximum likelihood settings, alternatives to Wald-based CIs exist; LR and score tests often have better small-sample properties.
Bayesian credible intervals: Provide posterior intervals for parameters — interpreted differently (probability parameter lies in interval conditional on data and prior).
Multiple-testing corrections (FDR, Bonferroni, Benjamini–Hochberg): When testing many coefficients, control global false discovery rates.

🚧 Common Challenges & Pitfalls

Misinterpreting p-values
- Wrong: “p = 0.03 means there’s a 3% chance the null hypothesis is true.”
- Right: “Given the null hypothesis is true, the probability of observing data at least as extreme as we did is 3%.”
- Implication: p-values do not give the probability that the effect is real.
Equating statistical significance with importance
- With large $n$, even negligible effects become statistically significant. Always check effect sizes and CIs.
Ignoring model mis-specification
- Nonlinearity, omitted variables, or correlated errors bias $\hat\beta$ and make SEs meaningless. Diagnostic checks (residual plots, specification tests) are essential.
Using standard SEs with heteroskedasticity or clustering
- Leads to wrong p-values and too-narrow CIs. Use robust or clustered SEs when appropriate.
Selection/“p-hacking” / post-selection inference
- Running many models and reporting the significant ones inflates false positives. Post-selection inference or held-out validation is needed.
Interpreting CIs incorrectly
- Wrong: “There is a 95% probability that β is in this interval.”
- Right (frequentist): Repeating the sampling process many times, 95% of similarly constructed intervals would contain the true β.
Overreliance on p < 0.05 threshold
- Thresholds are arbitrary. Prefer continuous interpretation (strength of evidence) and report exact p-values and CIs.
Collinearity
- Multicollinearity inflates SEs, producing wide CIs and unstable p-values despite possibly large joint predictive power.

Practical Guidance & Interview-Level Talking Points (concise)

Always report both p-values and confidence intervals — p-values tell about evidence, CIs show plausible effect sizes.
For real-world model assessment, emphasize stability (bootstrap or cross-validation of coefficients), predictive performance (holdout metrics), and robust SEs rather than blind reliance on significance.
If dataset is large, interpret significance in the context of effect size and cost/impact. If dataset is small, use bootstrap or likelihood-based methods and be cautious drawing strong conclusions.
When presenting results, note whether inference is conditional on the model specification. Distinguish causal claims (need identification strategy) from associational inference (standard regression).

Polynomial and Interaction Terms: Linear Regression Outliers and Robust Regression: Linear Regression