p-values and Confidence Intervals: Linear Regression
🎯 Core Idea
-
P-values and confidence intervals (CIs) quantify uncertainty about regression coefficients.
- A p-value measures how surprising the observed coefficient (or a more extreme one) is under a null hypothesis (usually that the coefficient = 0).
- A confidence interval gives a range of coefficient values consistent with the data at a chosen confidence level (e.g., 95%).
-
In regression, they help decide whether an estimated predictor effect is distinguishable from sampling noise and provide a range for plausible effect sizes.
🌱 Intuition & Real-World Analogy
- Why we use them: When we fit a regression, the point estimate (β̂) is noisy because our data are one sample from a larger population. P-values & CIs quantify that sampling noise so we can reason about whether the signal is real and how large it might be.
- Analogy 1 — Fishing with noisy sonar: Suppose β̂ is the blip you see on sonar. The p-value answers: “If there were no fish, how often would you see a blip at least this strong?” The CI answers: “Given the blip and noise, between what depths could that fish plausibly be?”
- Analogy 2 — Weather forecast: A point estimate is like “it will be 25°C”; a CI is like “it will be 23–27°C with 95% confidence.” The p-value is like asking “how surprising would 25°C be if the long-term average were 20°C?”
📐 Mathematical Foundation
Let the linear model be
$$ y = X\beta + \varepsilon,\qquad \varepsilon \sim (0,\,\sigma^2 I) $$with $X\in\mathbb{R}^{n\times p}$ (full rank assumed), OLS estimator:
$$ \hat\beta = (X^\top X)^{-1}X^\top y. $$Variance of the estimator
$$ \operatorname{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. $$Estimated residual variance:
$$ \hat\sigma^2 = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat y_i)^2. $$Standard error for coefficient $j$
$$ \mathrm{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2\,[ (X^\top X)^{-1} ]_{jj}}. $$t-statistic (for testing $H_0: \beta_j = \beta_{j,0}$, commonly $\beta_{j,0}=0$)
$$ t_j = \frac{\hat\beta_j - \beta_{j,0}}{\mathrm{SE}(\hat\beta_j)}. $$Under the classical assumptions (including Gaussian errors), $t_j \sim t_{n-p}$ under $H_0$. For large $n$, the $t$-distribution ≈ standard normal.
Two-sided p-value
$$ \text{p-value} = 2\Pr\big( T_{n-p} \ge |t_j| \big). $$(1 − α) Confidence Interval for $\beta_j$
$$ \hat\beta_j \pm t_{n-p,\,1-\alpha/2}\,\mathrm{SE}(\hat\beta_j), $$where $t_{n-p,,1-\alpha/2}$ is the $(1-\alpha/2)$-quantile of the $t$-distribution with $n-p$ degrees of freedom.
Derivation sketch (why CI has that form): Under assumptions, $(\hat\beta_j - \beta_j)/\mathrm{SE}(\hat\beta_j)$ follows a $t_{n-p}$ distribution. Hence probability
$$ \Pr\big( -t_{n-p,1-\alpha/2} \le \frac{\hat\beta_j-\beta_j}{\mathrm{SE}(\hat\beta_j)} \le t_{n-p,1-\alpha/2} \big) = 1-\alpha, $$which rearranges to the CI expression.
⚖️ Strengths, Limitations & Trade-offs
Strengths
- Direct uncertainty quantification for parameter estimates — helps assess evidence and estimate ranges for effect sizes.
- Closed-form in OLS under classical assumptions — computationally cheap and interpretable.
- Well-understood sampling properties (e.g., coverage probability of CI under model assumptions).
Limitations & trade-offs
- Dependence on assumptions: Validity requires correct model specification (linearity), independence, homoskedasticity (constant variance), and often normality for small samples. Violations bias SEs and p-values.
- Multiple testing: In models with many predictors, naïve p-values inflate false positives unless corrected (Bonferroni, FDR, etc.).
- Practical vs statistical significance: Very small effects can have tiny p-values in huge datasets but be irrelevant in practice. Conversely, meaningful effects may have non-significant p-values on small samples.
- Model selection bias: Doing variable selection (stepwise, peeking, p-hacking) invalidates standard p-values and CIs — they no longer reflect the true post-selection uncertainty.
- Asymptotic reliance: For complex models or small samples, asymptotic approximations (normality) may be poor.
🔍 Variants & Extensions
- Heteroskedasticity-robust SEs (White SEs): Adjust standard errors so inference is valid under heteroskedastic errors. Useful when $\operatorname{Var}(\varepsilon_i)$ is not constant.
- Clustered SEs: Account for within-cluster correlation (e.g., panel data, grouped data). Replace $I$ with block-structured variance estimator.
- Bootstrap CIs / p-values: Nonparametric bootstrap or percentile/t-bootstrap for complex estimators or small samples — relaxes distributional assumptions.
- Likelihood-based intervals (Wald, Score, Likelihood-ratio): In generalized linear models (GLMs) and maximum likelihood settings, alternatives to Wald-based CIs exist; LR and score tests often have better small-sample properties.
- Bayesian credible intervals: Provide posterior intervals for parameters — interpreted differently (probability parameter lies in interval conditional on data and prior).
- Multiple-testing corrections (FDR, Bonferroni, Benjamini–Hochberg): When testing many coefficients, control global false discovery rates.
🚧 Common Challenges & Pitfalls
-
Misinterpreting p-values
- Wrong: “p = 0.03 means there’s a 3% chance the null hypothesis is true.”
- Right: “Given the null hypothesis is true, the probability of observing data at least as extreme as we did is 3%.”
- Implication: p-values do not give the probability that the effect is real.
-
Equating statistical significance with importance
- With large $n$, even negligible effects become statistically significant. Always check effect sizes and CIs.
-
Ignoring model mis-specification
- Nonlinearity, omitted variables, or correlated errors bias $\hat\beta$ and make SEs meaningless. Diagnostic checks (residual plots, specification tests) are essential.
-
Using standard SEs with heteroskedasticity or clustering
- Leads to wrong p-values and too-narrow CIs. Use robust or clustered SEs when appropriate.
-
Selection/“p-hacking” / post-selection inference
- Running many models and reporting the significant ones inflates false positives. Post-selection inference or held-out validation is needed.
-
Interpreting CIs incorrectly
- Wrong: “There is a 95% probability that β is in this interval.”
- Right (frequentist): Repeating the sampling process many times, 95% of similarly constructed intervals would contain the true β.
-
Overreliance on p < 0.05 threshold
- Thresholds are arbitrary. Prefer continuous interpretation (strength of evidence) and report exact p-values and CIs.
-
Collinearity
- Multicollinearity inflates SEs, producing wide CIs and unstable p-values despite possibly large joint predictive power.
Practical Guidance & Interview-Level Talking Points (concise)
- Always report both p-values and confidence intervals — p-values tell about evidence, CIs show plausible effect sizes.
- For real-world model assessment, emphasize stability (bootstrap or cross-validation of coefficients), predictive performance (holdout metrics), and robust SEs rather than blind reliance on significance.
- If dataset is large, interpret significance in the context of effect size and cost/impact. If dataset is small, use bootstrap or likelihood-based methods and be cautious drawing strong conclusions.
- When presenting results, note whether inference is conditional on the model specification. Distinguish causal claims (need identification strategy) from associational inference (standard regression).