Generalized Linear Models (GLMs): Linear Regression

6 min read 1078 words

🎯 Core Idea

  • Generalized Linear Models (GLMs) extend linear regression to response variables with non-Gaussian distributions by (1) specifying a distribution from the exponential family for the response, (2) connecting the conditional mean to a linear predictor via a link function, and (3) estimating parameters by maximum likelihood (or quasi-likelihood).
  • Purpose: allow one unified framework (logistic for binary, Poisson for counts, Gamma for positive continuous, etc.) that preserves interpretability of linear predictors while matching the data’s noise model.

🌱 Intuition & Real-World Analogy

  • Why: Ordinary least squares assumes additive Gaussian noise. Many real-world targets violate that: counts, binary outcomes, positive skewed durations. GLMs let you keep a linear combination of features but change how that linear combination maps to the mean and how variability behaves.
  • Analogy 1 — Toolbox adapter: Think of a linear predictor $\eta = X\beta$ as the plug; the link function is an adapter that lets that plug fit different sockets (binary, counts, positive reals). The exponential-family choice sets the voltage/current (variance structure).
  • Analogy 2 — Thermostat control: $X\beta$ is the thermostat setting; the link function maps it into the actual temperature; the exponential-family model describes how noisy that temperature is around the set point.

📐 Mathematical Foundation

1) Exponential family form (one-parameter canonical form)

A random variable $Y$ is in the exponential family if its density/pmf can be written

$$ f(y;\theta,\phi)=\exp\!\Big(\frac{y\,\theta - b(\theta)}{a(\phi)} + c(y,\phi)\Big), $$

where:

  • $\theta$ = canonical parameter,
  • $b(\theta)$ = cumulant function,
  • $a(\phi)$ = dispersion parameter (often $\phi$ or 1),
  • $c(y,\phi)$ = normalization term.

Key relations:

$$ \mathbb{E}[Y] = \mu = b'(\theta), \qquad \mathrm{Var}(Y) = b''(\theta)\,a(\phi). $$

2) GLM components

  • Random component: $Y_i$ independent with distribution in exponential family.
  • Systematic component: linear predictor $\eta_i = x_i^\top \beta$.
  • Link function: $g(\mu_i) = \eta_i$, so $\mu_i = g^{-1}(\eta_i)$.

The canonical link is $g(\mu) = \theta$ (i.e., $g(\mu) = b’^{-1}(\mu)$) so that $\eta = \theta$.

3) Likelihood & Score (brief)

Log-likelihood (for independent observations):

$$ \ell(\beta)=\sum_i \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi), $$

with $\theta_i = b’^{-1}(\mu_i)$ and $\mu_i = g^{-1}(x_i^\top\beta)$.

Score:

$$ U(\beta)=\frac{\partial \ell}{\partial \beta}=\sum_i \frac{(y_i-\mu_i)}{a(\phi)}\frac{\partial \theta_i}{\partial \mu_i}\frac{\partial \mu_i}{\partial \eta_i} x_i. $$

When canonical link used, $\frac{\partial \theta}{\partial \mu}\frac{\partial \mu}{\partial \eta}=1$, simplifying the score.

4) Fisher information (observed / expected)

$$ \mathcal{I}(\beta) = X^\top W X,\qquad W_{ii} = \frac{1}{\mathrm{Var}(Y_i)}\left(\frac{d\mu_i}{d\eta_i}\right)^2 $$

This $W$ is central to IRLS (iteratively reweighted least squares).

5) Estimation: IRLS (sketch)

At each iteration, compute working response $z$ and weights $W$:

$$ z_i = \eta_i + (y_i-\mu_i)\left(\frac{d\eta_i}{d\mu_i}\right),\qquad W_{ii} = \left(\frac{d\mu_i}{d\eta_i}\right)^2/\mathrm{Var}(Y_i). $$

Then update $\beta$ via weighted least squares:

$$ \beta^{(t+1)} = (X^\top W X)^{-1} X^\top W z. $$

Derivation: why IRLS arises

Start from Newton–Raphson on log-likelihood $\beta \leftarrow \beta - H^{-1} U$ where $H$ is Hessian. After algebra using exponential family, Newton step is equivalent to performing WLS on working response $z$ with weights $W$. See the derivation in standard texts (Hastie & Tibshirani).


⚖️ Strengths, Limitations & Trade-offs

Strengths

  • Unified framework for many outcome types (binary, counts, positive continuous).
  • Interpretable linear predictors: coefficients retain meaning on transformed scale (log-odds, log-rate).
  • Maximum-likelihood inference: standard errors, hypothesis tests, and deviance-based model comparison are available.
  • Efficient and well-understood optimization (IRLS/Newton–Raphson).

Limitations

  • Model misspecification: wrong link or wrong family leads to biased inference. GLMs are sensitive to the assumed variance function.
  • Overdispersion: Poisson assumes $\mathrm{Var}(Y)=\mathbb{E}[Y]$; if variance > mean, standard errors are wrong.
  • Separation (binary outcomes): complete separation causes infinite MLEs in logistic regression.
  • Independence assumption: GLMs assume independent observations; clustered or correlated data require extensions (GEE, mixed models).
  • Linear predictor restriction: nonlinearity in features must be handled by basis expansions or interactions—GLM itself is linear in parameters.

Trade-offs

  • Flexibility vs interpretability: moving to non-linear link or non-canonical links may improve fit at cost of more complex interpretations.
  • Simplicity vs variance correctness: canonical link often simplifies math; non-canonical link may better model the mean but complicate estimation.

🔍 Variants & Extensions

  • Logistic regression: binary outcomes, Bernoulli family, logit link (canonical: logit). Useful for classification and odds interpretation.
  • Binomial / Proportions: counts of successes out of $n$, uses logit or probit link.
  • Poisson regression: counts, canonical log link; use offsets for exposure/time.
  • Negative binomial / quasi-Poisson: handle overdispersion for counts.
  • Multinomial / Softmax regression: categorical outcomes with $K>2$ classes; uses generalized logit (softmax); can be framed as multivariate GLM.
  • Gamma regression: positive continuous, common link: inverse or log; models heteroskedastic positive data.
  • Inverse Gaussian, Tweedie: for compound or heavy-tailed positive data (Tweedie useful for mass-at-zero + continuous positive).
  • Generalized Estimating Equations (GEE): relax independence by specifying working correlation for clustered data—provides robust (sandwich) SEs.
  • Generalized Linear Mixed Models (GLMMs): add random effects for hierarchical/clustered data; estimation via Laplace approximation or integration (more complex).
  • Quasi-likelihood models: specify mean-variance relationship without full distributional form—useful for overdispersed data.
  • Regularized GLMs: add L1/L2 penalties (lasso, ridge) to handle high-dimensional predictors.

🚧 Common Challenges & Pitfalls

  • Separation in logistic regression: predictor(s) perfectly separate classes → infinite coefficients. Remedies: penalized likelihood (ridge, Firth bias reduction), remove problematic predictors, or use Bayesian priors.
  • Overdispersion in Poisson: naive Poisson underestimates SEs. Diagnose by Pearson chi-square / residual deviance; fix with quasi-Poisson or NB.
  • Mis-specified link: using a wrong link leads to poor fit and biased coefficients. Check residuals (deviance, Pearson), compare links via AIC/deviance where applicable.
  • Interpreting coefficients incorrectly: coefficients are on the link scale. Always translate to the response scale (e.g., exponentiate log-link coefficients to get multiplicative effects).
  • Ignoring offsets: counts per exposure/time require an offset (e.g., log(exposure)) to model rates; omitting offsets biases coefficients.
  • Confounding of dispersion and link: sometimes poor fit arises from both mean structure and variance; consider both before fixing.
  • Using GLM when dependence exists: clustered data with GLM yields anti-conservative inference. Use GEE/GLMM.
  • Numerical instability: small cell counts, rare events, or collinearity may produce unstable estimates. Use regularization or PCA.
  • Over-reliance on asymptotics: for small samples, Wald tests can be misleading; prefer likelihood ratio tests, profile likelihood, or exact methods.

📚 Reference Pointers

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!