Linear Regression System Design: Interview Framework
1. Problem Framing & Scoping
-
Understand the Goal: Be explicit about the business objective you are solving with linear regression. Example goals:
- Predicting house prices to power a listing price suggestion feature (minimize absolute price error).
- Forecasting daily ad revenue for budgeting (minimize percent error / MAPE).
- Estimating customer lifetime spend as a continuous target for ranking offers.
-
Define the User Experience: Who consumes the predictions and how fresh must they be?
- UI: real-time price suggestion shown to sellers at listing time (latency
<200ms
). - Batch report: nightly revenue forecast used by finance (latency tolerance hours).
- UI: real-time price suggestion shown to sellers at listing time (latency
-
Establish Scope & Constraints:
Drill down on key constraints
- Scale: rows (10K → 100M), features (10 → 10K), and cardinality of categorical features (e.g., city has 10k unique values).
- Latency: real-time (<200ms) vs batch (hourly/daily).
- Data freshness: streaming clicks vs monthly aggregated attributes.
- Error tolerance: Is a ±5% error acceptable, or do we need calibrated uncertainty estimates?
2. Data & Feature Engineering
-
Data Sources (Linear Regression specific):
- For house price example: listing features (sqft, bedrooms), location (zip code), transaction history, days-on-market, recent comparable sales.
- For revenue forecast: historical daily revenue, holidays, marketing spend, active campaigns, seasonality signals.
-
Label Generation: Continuous target setup.
- House price: use final sale price (log-transform if skewed).
- Revenue: daily total revenue (consider per-user normalization).
- Document label smoothing choices: use next-30-day average vs single day to reduce noise.
-
Feature Brainstorming (concrete for linear models):
- Numerical: size, age, price per sqft, historical averages.
- Categorical: neighborhood (one-hot or target-encode), property type.
- Interaction / polynomial: sqft × bedrooms, (age)^2 to capture non-linearity.
- Temporal features: day-of-week, month, days-since-listing.
-
Preprocessing & Practicalities:
- Missing values: impute median for numeric, “missing” category for categorical.
- Scaling: Standardize numeric features (critical for gradient descent & regularization).
- Encoding: prefer target/mean encoding for high-cardinality categories; beware leakage—compute encodings on training folds only.
- Transform skewed targets/features: log-target for positive-skewed price data.
-
Feature Store / Reproducibility: Define exact pipelines (vectorizers, encoders, scalers) to re-use at inference time.
Note: For linear models, engineered features matter more than model complexity. Show concrete reasoning: e.g., instead of deep nets, adding a
sqft × age
interaction can capture depreciation.
3. Model Selection: Baseline & Iterations
-
Start with a Simple Baseline:
- Zero model: predict training mean or median. For house prices, median is robust to outliers.
- Rule-based baseline: e.g., average price-per-sqft in the neighborhood × sqft.
- Why: provides a floor; any ML model must beat this.
-
Initial Model Proposal:
- Closed-form OLS (analytic solution) for small-medium data: fast, interpretable, gives coefficient variance estimates.
- Iterative solver (Batch/SGD) for large data where matrix inversion is infeasible.
- Regularized linear models: Ridge (L2) for multicollinearity, Lasso (L1) if expect sparsity in features.
-
Concrete trade-offs for Linear Regression:
- OLS gives exact solution but is O(d³) for inversion — not feasible when d large (e.g., many one-hots).
- SGD/Mini-batch: scales to large n, supports streaming retraining, needs learning rate tuning.
- Lasso can zero coefficients — useful for feature selection when you engineered thousands of interaction terms.
-
Advanced Iterations (when to move beyond linear):
- Add polynomial features and strong regularization before moving to nonlinear models.
- If residual analysis shows structured nonlinearity, consider tree-based models or neural nets.
-
Cold Start / New Categories:
- Use global averages or hierarchical priors (e.g., global → city → neighborhood) and gradually backfill with target-encoding computed with smoothing.
4. System Architecture & Serving
-
High-Level Design (tailored to linear regression):
- Offline training pipeline → model artifact (weights, preprocessing metadata) → feature store → serving layer (batch precompute or online API).
- For price suggestions: realtime API that loads preprocessing + weight vector and applies
y = Xβ
.
-
Batch vs Real-time trade-off:
Compare Batch vs. Real-time serving strategies
- Batch: Precompute predictions for all listings nightly. Good when features update slowly (e.g., tax data).
- Real-time: Compute prediction at request time for user edits (changed sqft). Ensure preprocessors are low-latency (no heavy joins).
-
Feature Store & Consistency: Store canonical feature transforms; version them. At serve time, use the same scaler/encoders used in training.
-
Serving API Contract (example):
POST /predict_price
body:{ "sqft":..., "bedrooms":..., "zip":..., "listing_date":... }
- Response:
{ "predicted_price":..., "std_error":..., "model_version": "v2025-09-22" }
-
Scalability considerations specific to regression:
- Model size is tiny (weights vector) — cheap to distribute. Main cost is feature assembly (joins, encodings).
- For heavy categorical encodings, precompute embedding / bucketed maps in the feature store to avoid runtime DB hits.
5. Training & Evaluation Strategy
-
Training Pipeline:
- Daily/weekly batch job: extract last N months → featurize (with same pipeline) → train with cross-validation → evaluate → push to model registry if passes gates.
- For huge datasets: sample stratified by key variable (city) or use online learning (SGD) to continuously update weights.
-
Offline Evaluation Metrics (aligned to goal):
- Regression metrics: RMSE, MAE, R²; for skewed targets prefer MAE or median absolute error.
- Business metrics: percent of predictions within ±10% of true price; MAPE for revenue.
- Calibration: predicted vs actual residual histograms, prediction intervals coverage.
-
Data Splitting Strategy:
- Time-based split for forecasting tasks (train on t0..tT, validate on T+1..T+k).
- Grouped split if data has grouped correlations (e.g., house transactions grouped by neighborhood) to avoid leakage.
-
Model Selection & Hyperparameter Tuning:
- Grid/Random search for regularization strength (α in Ridge/Lasso).
- Use cross-validation folds that respect time/order or groups.
-
Diagnostic Checks (specific to linear regression):
- Residual plots vs predicted and vs each feature: look for heteroscedasticity or nonlinearity.
- VIF (Variance Inflation Factor) to detect multicollinearity.
- Outlier influence: leverage and Cook’s distance to spot high-impact points.
Note: Report both statistical significance (p-values) and practical significance (effect size). In production, stable predictive power often wins over marginal p-value improvements.
6. Monitoring & Failsafes
-
Performance Monitoring (what to track for linear regression):
- Model metrics: daily RMSE/MAE on a held-out streaming validation sample; distribution of residuals.
- Input drift: monitor feature distributions (mean, std) and cardinality of categorical features (new zip codes).
- Prediction health: fraction of predictions with very large residuals; sudden shifts in mean prediction.
- System metrics: latency of feature assembly and prediction, error rate of API.
-
Drift & Alerting Strategy:
- Statistical tests: KL divergence or population stability index (PSI) for feature drift.
- Residual drift: track mean residual over time — steady bias indicates concept drift.
-
Failsafes & Fallbacks:
- If pipeline fails or predictions spike: fallback to simple rule (neighborhood average) or last known good model.
- Canary deployments: route small % of traffic to new model to compare production residuals before full rollout.
-
Retraining Triggers:
- Scheduled retrain (e.g., weekly) + event-based retrain if drift metric exceeds threshold.
-
Explainability & Auditing:
- Log feature vectors and prediction reasons (top contributing features) for samples flagged by stakeholders.
7. Iteration & A/B Testing
-
A/B Testing Design for Regression Outputs:
- Randomize users or listings into control (current system) vs treatment (new regression model).
- Primary evaluation: business metric (e.g., conversion rate after showing predicted price, revenue uplift). Secondary: prediction accuracy (MAE) and user-facing KPIs.
- Choose appropriate statistical tests for continuous outcomes (compare means with t-test / bootstrap; consider metric variance when computing sample sizes).
-
Ramp & Safety:
- Start with small percent traffic → monitor key metrics (residuals, business metrics) → ramp progressively.
-
Feedback Loop:
- Capture interactions (accept/reject suggested price, final sale price) and append to training store with timestamps and metadata.
- Use logged production data for periodic re-training or online updates.
-
Iterative Improvements (linear-specific):
- Add targeted interaction/polynomial terms that fix observed residual patterns.
- If residuals show non-linear structure, try piecewise linear models (segmented regressions) before jumping to complex models.
-
Postmortem & Learning:
- After each A/B test, analyze feature importance, residual slices (by zip, price band), and limitations to prioritize next engineering work.