Generalized Linear Models (GLMs): Linear Regression
🎯 Core Idea
- Generalized Linear Models (GLMs) extend linear regression to response variables with non-Gaussian distributions by (1) specifying a distribution from the exponential family for the response, (2) connecting the conditional mean to a linear predictor via a link function, and (3) estimating parameters by maximum likelihood (or quasi-likelihood).
- Purpose: allow one unified framework (logistic for binary, Poisson for counts, Gamma for positive continuous, etc.) that preserves interpretability of linear predictors while matching the data’s noise model.
🌱 Intuition & Real-World Analogy
- Why: Ordinary least squares assumes additive Gaussian noise. Many real-world targets violate that: counts, binary outcomes, positive skewed durations. GLMs let you keep a linear combination of features but change how that linear combination maps to the mean and how variability behaves.
- Analogy 1 — Toolbox adapter: Think of a linear predictor $\eta = X\beta$ as the plug; the link function is an adapter that lets that plug fit different sockets (binary, counts, positive reals). The exponential-family choice sets the voltage/current (variance structure).
- Analogy 2 — Thermostat control: $X\beta$ is the thermostat setting; the link function maps it into the actual temperature; the exponential-family model describes how noisy that temperature is around the set point.
📐 Mathematical Foundation
1) Exponential family form (one-parameter canonical form)
A random variable $Y$ is in the exponential family if its density/pmf can be written
$$ f(y;\theta,\phi)=\exp\!\Big(\frac{y\,\theta - b(\theta)}{a(\phi)} + c(y,\phi)\Big), $$where:
- $\theta$ = canonical parameter,
- $b(\theta)$ = cumulant function,
- $a(\phi)$ = dispersion parameter (often $\phi$ or 1),
- $c(y,\phi)$ = normalization term.
Key relations:
$$ \mathbb{E}[Y] = \mu = b'(\theta), \qquad \mathrm{Var}(Y) = b''(\theta)\,a(\phi). $$2) GLM components
- Random component: $Y_i$ independent with distribution in exponential family.
- Systematic component: linear predictor $\eta_i = x_i^\top \beta$.
- Link function: $g(\mu_i) = \eta_i$, so $\mu_i = g^{-1}(\eta_i)$.
The canonical link is $g(\mu) = \theta$ (i.e., $g(\mu) = b’^{-1}(\mu)$) so that $\eta = \theta$.
3) Likelihood & Score (brief)
Log-likelihood (for independent observations):
$$ \ell(\beta)=\sum_i \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi), $$with $\theta_i = b’^{-1}(\mu_i)$ and $\mu_i = g^{-1}(x_i^\top\beta)$.
Score:
$$ U(\beta)=\frac{\partial \ell}{\partial \beta}=\sum_i \frac{(y_i-\mu_i)}{a(\phi)}\frac{\partial \theta_i}{\partial \mu_i}\frac{\partial \mu_i}{\partial \eta_i} x_i. $$When canonical link used, $\frac{\partial \theta}{\partial \mu}\frac{\partial \mu}{\partial \eta}=1$, simplifying the score.
4) Fisher information (observed / expected)
$$ \mathcal{I}(\beta) = X^\top W X,\qquad W_{ii} = \frac{1}{\mathrm{Var}(Y_i)}\left(\frac{d\mu_i}{d\eta_i}\right)^2 $$This $W$ is central to IRLS (iteratively reweighted least squares).
5) Estimation: IRLS (sketch)
At each iteration, compute working response $z$ and weights $W$:
$$ z_i = \eta_i + (y_i-\mu_i)\left(\frac{d\eta_i}{d\mu_i}\right),\qquad W_{ii} = \left(\frac{d\mu_i}{d\eta_i}\right)^2/\mathrm{Var}(Y_i). $$Then update $\beta$ via weighted least squares:
$$ \beta^{(t+1)} = (X^\top W X)^{-1} X^\top W z. $$Derivation: why IRLS arises
Start from Newton–Raphson on log-likelihood $\beta \leftarrow \beta - H^{-1} U$ where $H$ is Hessian. After algebra using exponential family, Newton step is equivalent to performing WLS on working response $z$ with weights $W$. See the derivation in standard texts (Hastie & Tibshirani).
⚖️ Strengths, Limitations & Trade-offs
Strengths
- Unified framework for many outcome types (binary, counts, positive continuous).
- Interpretable linear predictors: coefficients retain meaning on transformed scale (log-odds, log-rate).
- Maximum-likelihood inference: standard errors, hypothesis tests, and deviance-based model comparison are available.
- Efficient and well-understood optimization (IRLS/Newton–Raphson).
Limitations
- Model misspecification: wrong link or wrong family leads to biased inference. GLMs are sensitive to the assumed variance function.
- Overdispersion: Poisson assumes $\mathrm{Var}(Y)=\mathbb{E}[Y]$; if variance > mean, standard errors are wrong.
- Separation (binary outcomes): complete separation causes infinite MLEs in logistic regression.
- Independence assumption: GLMs assume independent observations; clustered or correlated data require extensions (GEE, mixed models).
- Linear predictor restriction: nonlinearity in features must be handled by basis expansions or interactions—GLM itself is linear in parameters.
Trade-offs
- Flexibility vs interpretability: moving to non-linear link or non-canonical links may improve fit at cost of more complex interpretations.
- Simplicity vs variance correctness: canonical link often simplifies math; non-canonical link may better model the mean but complicate estimation.
🔍 Variants & Extensions
- Logistic regression: binary outcomes, Bernoulli family, logit link (canonical: logit). Useful for classification and odds interpretation.
- Binomial / Proportions: counts of successes out of $n$, uses logit or probit link.
- Poisson regression: counts, canonical log link; use offsets for exposure/time.
- Negative binomial / quasi-Poisson: handle overdispersion for counts.
- Multinomial / Softmax regression: categorical outcomes with $K>2$ classes; uses generalized logit (softmax); can be framed as multivariate GLM.
- Gamma regression: positive continuous, common link: inverse or log; models heteroskedastic positive data.
- Inverse Gaussian, Tweedie: for compound or heavy-tailed positive data (Tweedie useful for mass-at-zero + continuous positive).
- Generalized Estimating Equations (GEE): relax independence by specifying working correlation for clustered data—provides robust (sandwich) SEs.
- Generalized Linear Mixed Models (GLMMs): add random effects for hierarchical/clustered data; estimation via Laplace approximation or integration (more complex).
- Quasi-likelihood models: specify mean-variance relationship without full distributional form—useful for overdispersed data.
- Regularized GLMs: add L1/L2 penalties (lasso, ridge) to handle high-dimensional predictors.
🚧 Common Challenges & Pitfalls
- Separation in logistic regression: predictor(s) perfectly separate classes → infinite coefficients. Remedies: penalized likelihood (ridge, Firth bias reduction), remove problematic predictors, or use Bayesian priors.
- Overdispersion in Poisson: naive Poisson underestimates SEs. Diagnose by Pearson chi-square / residual deviance; fix with quasi-Poisson or NB.
- Mis-specified link: using a wrong link leads to poor fit and biased coefficients. Check residuals (deviance, Pearson), compare links via AIC/deviance where applicable.
- Interpreting coefficients incorrectly: coefficients are on the link scale. Always translate to the response scale (e.g., exponentiate log-link coefficients to get multiplicative effects).
- Ignoring offsets: counts per exposure/time require an offset (e.g., log(exposure)) to model rates; omitting offsets biases coefficients.
- Confounding of dispersion and link: sometimes poor fit arises from both mean structure and variance; consider both before fixing.
- Using GLM when dependence exists: clustered data with GLM yields anti-conservative inference. Use GEE/GLMM.
- Numerical instability: small cell counts, rare events, or collinearity may produce unstable estimates. Use regularization or PCA.
- Over-reliance on asymptotics: for small samples, Wald tests can be misleading; prefer likelihood ratio tests, profile likelihood, or exact methods.
📚 Reference Pointers
- Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society A. — foundational paper. https://doi.org/10.2307/2344614
- Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models (for extension to non-linear predictors). ISBN. (Good to understand smooth extensions.)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning — chapter on classification and GLMs. https://web.stanford.edu/~hastie/ElemStatLearn/
- Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models — practical guide and diagnostics.
- Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models — for probabilistic perspective and exponential family. https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models
- Wikipedia — Generalized linear model overview (useful quick reference): https://en.wikipedia.org/wiki/Generalized_linear_model