Generalized Linear Models (GLMs): Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

6 min read 1078 words

🎯 Core Idea

Generalized Linear Models (GLMs) extend linear regression to response variables with non-Gaussian distributions by (1) specifying a distribution from the exponential family for the response, (2) connecting the conditional mean to a linear predictor via a link function, and (3) estimating parameters by maximum likelihood (or quasi-likelihood).
Purpose: allow one unified framework (logistic for binary, Poisson for counts, Gamma for positive continuous, etc.) that preserves interpretability of linear predictors while matching the data’s noise model.

🌱 Intuition & Real-World Analogy

Why: Ordinary least squares assumes additive Gaussian noise. Many real-world targets violate that: counts, binary outcomes, positive skewed durations. GLMs let you keep a linear combination of features but change how that linear combination maps to the mean and how variability behaves.
Analogy 1 — Toolbox adapter: Think of a linear predictor $\eta = X\beta$ as the plug; the link function is an adapter that lets that plug fit different sockets (binary, counts, positive reals). The exponential-family choice sets the voltage/current (variance structure).
Analogy 2 — Thermostat control: $X\beta$ is the thermostat setting; the link function maps it into the actual temperature; the exponential-family model describes how noisy that temperature is around the set point.

📐 Mathematical Foundation

1) Exponential family form (one-parameter canonical form)

A random variable $Y$ is in the exponential family if its density/pmf can be written

$$ f(y;\theta,\phi)=\exp\!\Big(\frac{y\,\theta - b(\theta)}{a(\phi)} + c(y,\phi)\Big), $$

where:

$\theta$ = canonical parameter,
$b(\theta)$ = cumulant function,
$a(\phi)$ = dispersion parameter (often $\phi$ or 1),
$c(y,\phi)$ = normalization term.

Key relations:

$$ \mathbb{E}[Y] = \mu = b'(\theta), \qquad \mathrm{Var}(Y) = b''(\theta)\,a(\phi). $$

2) GLM components

Random component: $Y_i$ independent with distribution in exponential family.
Systematic component: linear predictor $\eta_i = x_i^\top \beta$.
Link function: $g(\mu_i) = \eta_i$, so $\mu_i = g^{-1}(\eta_i)$.

The canonical link is $g(\mu) = \theta$ (i.e., $g(\mu) = b’^{-1}(\mu)$) so that $\eta = \theta$.

3) Likelihood & Score (brief)

Log-likelihood (for independent observations):

$$ \ell(\beta)=\sum_i \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi), $$

with $\theta_i = b’^{-1}(\mu_i)$ and $\mu_i = g^{-1}(x_i^\top\beta)$.

Score:

$$ U(\beta)=\frac{\partial \ell}{\partial \beta}=\sum_i \frac{(y_i-\mu_i)}{a(\phi)}\frac{\partial \theta_i}{\partial \mu_i}\frac{\partial \mu_i}{\partial \eta_i} x_i. $$

When canonical link used, $\frac{\partial \theta}{\partial \mu}\frac{\partial \mu}{\partial \eta}=1$, simplifying the score.

4) Fisher information (observed / expected)

$$ \mathcal{I}(\beta) = X^\top W X,\qquad W_{ii} = \frac{1}{\mathrm{Var}(Y_i)}\left(\frac{d\mu_i}{d\eta_i}\right)^2 $$

This $W$ is central to IRLS (iteratively reweighted least squares).

5) Estimation: IRLS (sketch)

At each iteration, compute working response $z$ and weights $W$:

$$ z_i = \eta_i + (y_i-\mu_i)\left(\frac{d\eta_i}{d\mu_i}\right),\qquad W_{ii} = \left(\frac{d\mu_i}{d\eta_i}\right)^2/\mathrm{Var}(Y_i). $$

Then update $\beta$ via weighted least squares:

$$ \beta^{(t+1)} = (X^\top W X)^{-1} X^\top W z. $$

Derivation: why IRLS arises

Start from Newton–Raphson on log-likelihood $\beta \leftarrow \beta - H^{-1} U$ where $H$ is Hessian. After algebra using exponential family, Newton step is equivalent to performing WLS on working response $z$ with weights $W$. See the derivation in standard texts (Hastie & Tibshirani).

⚖️ Strengths, Limitations & Trade-offs

Strengths

Unified framework for many outcome types (binary, counts, positive continuous).
Interpretable linear predictors: coefficients retain meaning on transformed scale (log-odds, log-rate).
Maximum-likelihood inference: standard errors, hypothesis tests, and deviance-based model comparison are available.
Efficient and well-understood optimization (IRLS/Newton–Raphson).

Limitations

Model misspecification: wrong link or wrong family leads to biased inference. GLMs are sensitive to the assumed variance function.
Overdispersion: Poisson assumes $\mathrm{Var}(Y)=\mathbb{E}[Y]$; if variance > mean, standard errors are wrong.
Separation (binary outcomes): complete separation causes infinite MLEs in logistic regression.
Independence assumption: GLMs assume independent observations; clustered or correlated data require extensions (GEE, mixed models).
Linear predictor restriction: nonlinearity in features must be handled by basis expansions or interactions—GLM itself is linear in parameters.

Trade-offs

Flexibility vs interpretability: moving to non-linear link or non-canonical links may improve fit at cost of more complex interpretations.
Simplicity vs variance correctness: canonical link often simplifies math; non-canonical link may better model the mean but complicate estimation.

🔍 Variants & Extensions

Logistic regression: binary outcomes, Bernoulli family, logit link (canonical: logit). Useful for classification and odds interpretation.
Binomial / Proportions: counts of successes out of $n$, uses logit or probit link.
Poisson regression: counts, canonical log link; use offsets for exposure/time.
Negative binomial / quasi-Poisson: handle overdispersion for counts.
Multinomial / Softmax regression: categorical outcomes with $K>2$ classes; uses generalized logit (softmax); can be framed as multivariate GLM.
Gamma regression: positive continuous, common link: inverse or log; models heteroskedastic positive data.
Inverse Gaussian, Tweedie: for compound or heavy-tailed positive data (Tweedie useful for mass-at-zero + continuous positive).
Generalized Estimating Equations (GEE): relax independence by specifying working correlation for clustered data—provides robust (sandwich) SEs.
Generalized Linear Mixed Models (GLMMs): add random effects for hierarchical/clustered data; estimation via Laplace approximation or integration (more complex).
Quasi-likelihood models: specify mean-variance relationship without full distributional form—useful for overdispersed data.
Regularized GLMs: add L1/L2 penalties (lasso, ridge) to handle high-dimensional predictors.

🚧 Common Challenges & Pitfalls

Separation in logistic regression: predictor(s) perfectly separate classes → infinite coefficients. Remedies: penalized likelihood (ridge, Firth bias reduction), remove problematic predictors, or use Bayesian priors.
Overdispersion in Poisson: naive Poisson underestimates SEs. Diagnose by Pearson chi-square / residual deviance; fix with quasi-Poisson or NB.
Mis-specified link: using a wrong link leads to poor fit and biased coefficients. Check residuals (deviance, Pearson), compare links via AIC/deviance where applicable.
Interpreting coefficients incorrectly: coefficients are on the link scale. Always translate to the response scale (e.g., exponentiate log-link coefficients to get multiplicative effects).
Ignoring offsets: counts per exposure/time require an offset (e.g., log(exposure)) to model rates; omitting offsets biases coefficients.
Confounding of dispersion and link: sometimes poor fit arises from both mean structure and variance; consider both before fixing.
Using GLM when dependence exists: clustered data with GLM yields anti-conservative inference. Use GEE/GLMM.
Numerical instability: small cell counts, rare events, or collinearity may produce unstable estimates. Use regularization or PCA.
Over-reliance on asymptotics: for small samples, Wald tests can be misleading; prefer likelihood ratio tests, profile likelihood, or exact methods.

📚 Reference Pointers

Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society A. — foundational paper. https://doi.org/10.2307/2344614
Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models (for extension to non-linear predictors). ISBN. (Good to understand smooth extensions.)
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning — chapter on classification and GLMs. https://web.stanford.edu/~hastie/ElemStatLearn/
Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models — practical guide and diagnostics.
Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models — for probabilistic perspective and exponential family. https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models
Wikipedia — Generalized linear model overview (useful quick reference): https://en.wikipedia.org/wiki/Generalized_linear_model

Linear Regression Coding Examples (Python & NumPy)Feature Scaling: Linear Regression