Linear Regression System Design: Interview Framework

7 min read 1396 words

1. Problem Framing & Scoping

  • Understand the Goal: Be explicit about the business objective you are solving with linear regression. Example goals:

    • Predicting house prices to power a listing price suggestion feature (minimize absolute price error).
    • Forecasting daily ad revenue for budgeting (minimize percent error / MAPE).
    • Estimating customer lifetime spend as a continuous target for ranking offers.
  • Define the User Experience: Who consumes the predictions and how fresh must they be?

    • UI: real-time price suggestion shown to sellers at listing time (latency <200ms).
    • Batch report: nightly revenue forecast used by finance (latency tolerance hours).
  • Establish Scope & Constraints:

    Drill down on key constraints
    • Scale: rows (10K → 100M), features (10 → 10K), and cardinality of categorical features (e.g., city has 10k unique values).
    • Latency: real-time (<200ms) vs batch (hourly/daily).
    • Data freshness: streaming clicks vs monthly aggregated attributes.
    • Error tolerance: Is a ±5% error acceptable, or do we need calibrated uncertainty estimates?
ℹ️
I want to hear you convert vague product goals into measurable ML objectives. Candidates who skip clarifying the objective (loss to minimize, acceptable latency, scale) lose points quickly. Phrase requirements as concrete metrics we can optimize.

2. Data & Feature Engineering

  • Data Sources (Linear Regression specific):

    • For house price example: listing features (sqft, bedrooms), location (zip code), transaction history, days-on-market, recent comparable sales.
    • For revenue forecast: historical daily revenue, holidays, marketing spend, active campaigns, seasonality signals.
  • Label Generation: Continuous target setup.

    • House price: use final sale price (log-transform if skewed).
    • Revenue: daily total revenue (consider per-user normalization).
    • Document label smoothing choices: use next-30-day average vs single day to reduce noise.
  • Feature Brainstorming (concrete for linear models):

    • Numerical: size, age, price per sqft, historical averages.
    • Categorical: neighborhood (one-hot or target-encode), property type.
    • Interaction / polynomial: sqft × bedrooms, (age)^2 to capture non-linearity.
    • Temporal features: day-of-week, month, days-since-listing.
  • Preprocessing & Practicalities:

    • Missing values: impute median for numeric, “missing” category for categorical.
    • Scaling: Standardize numeric features (critical for gradient descent & regularization).
    • Encoding: prefer target/mean encoding for high-cardinality categories; beware leakage—compute encodings on training folds only.
    • Transform skewed targets/features: log-target for positive-skewed price data.
  • Feature Store / Reproducibility: Define exact pipelines (vectorizers, encoders, scalers) to re-use at inference time.

Note: For linear models, engineered features matter more than model complexity. Show concrete reasoning: e.g., instead of deep nets, adding a sqft × age interaction can capture depreciation.

3. Model Selection: Baseline & Iterations

  • Start with a Simple Baseline:

    • Zero model: predict training mean or median. For house prices, median is robust to outliers.
    • Rule-based baseline: e.g., average price-per-sqft in the neighborhood × sqft.
    • Why: provides a floor; any ML model must beat this.
  • Initial Model Proposal:

    • Closed-form OLS (analytic solution) for small-medium data: fast, interpretable, gives coefficient variance estimates.
    • Iterative solver (Batch/SGD) for large data where matrix inversion is infeasible.
    • Regularized linear models: Ridge (L2) for multicollinearity, Lasso (L1) if expect sparsity in features.
  • Concrete trade-offs for Linear Regression:

    • OLS gives exact solution but is O(d³) for inversion — not feasible when d large (e.g., many one-hots).
    • SGD/Mini-batch: scales to large n, supports streaming retraining, needs learning rate tuning.
    • Lasso can zero coefficients — useful for feature selection when you engineered thousands of interaction terms.
  • Advanced Iterations (when to move beyond linear):

    • Add polynomial features and strong regularization before moving to nonlinear models.
    • If residual analysis shows structured nonlinearity, consider tree-based models or neural nets.
  • Cold Start / New Categories:

    • Use global averages or hierarchical priors (e.g., global → city → neighborhood) and gradually backfill with target-encoding computed with smoothing.
ℹ️
When you pick a model, justify it with data size, feature dimensionality, and interpretability needs. Saying “use X because it’s powerful” without these constraints is weak. Also explicitly mention how you will get coefficient uncertainty (for OLS) if stakeholders need confidence intervals.

4. System Architecture & Serving

  • High-Level Design (tailored to linear regression):

    • Offline training pipeline → model artifact (weights, preprocessing metadata) → feature store → serving layer (batch precompute or online API).
    • For price suggestions: realtime API that loads preprocessing + weight vector and applies y = Xβ.
  • Batch vs Real-time trade-off:

    Compare Batch vs. Real-time serving strategies
    • Batch: Precompute predictions for all listings nightly. Good when features update slowly (e.g., tax data).
    • Real-time: Compute prediction at request time for user edits (changed sqft). Ensure preprocessors are low-latency (no heavy joins).
  • Feature Store & Consistency: Store canonical feature transforms; version them. At serve time, use the same scaler/encoders used in training.

  • Serving API Contract (example):

    • POST /predict_price body: { "sqft":..., "bedrooms":..., "zip":..., "listing_date":... }
    • Response: { "predicted_price":..., "std_error":..., "model_version": "v2025-09-22" }
  • Scalability considerations specific to regression:

    • Model size is tiny (weights vector) — cheap to distribute. Main cost is feature assembly (joins, encodings).
    • For heavy categorical encodings, precompute embedding / bucketed maps in the feature store to avoid runtime DB hits.
ℹ️
Talk me through the boxes: data lake → ETL → feature store → trainer → model artifact registry → online features + serving model. Explain how you keep preprocessing identical between train and serve. Diagrams in words are fine.

5. Training & Evaluation Strategy

  • Training Pipeline:

    • Daily/weekly batch job: extract last N months → featurize (with same pipeline) → train with cross-validation → evaluate → push to model registry if passes gates.
    • For huge datasets: sample stratified by key variable (city) or use online learning (SGD) to continuously update weights.
  • Offline Evaluation Metrics (aligned to goal):

    • Regression metrics: RMSE, MAE, R²; for skewed targets prefer MAE or median absolute error.
    • Business metrics: percent of predictions within ±10% of true price; MAPE for revenue.
    • Calibration: predicted vs actual residual histograms, prediction intervals coverage.
  • Data Splitting Strategy:

    • Time-based split for forecasting tasks (train on t0..tT, validate on T+1..T+k).
    • Grouped split if data has grouped correlations (e.g., house transactions grouped by neighborhood) to avoid leakage.
  • Model Selection & Hyperparameter Tuning:

    • Grid/Random search for regularization strength (α in Ridge/Lasso).
    • Use cross-validation folds that respect time/order or groups.
  • Diagnostic Checks (specific to linear regression):

    • Residual plots vs predicted and vs each feature: look for heteroscedasticity or nonlinearity.
    • VIF (Variance Inflation Factor) to detect multicollinearity.
    • Outlier influence: leverage and Cook’s distance to spot high-impact points.

Note: Report both statistical significance (p-values) and practical significance (effect size). In production, stable predictive power often wins over marginal p-value improvements.

6. Monitoring & Failsafes

  • Performance Monitoring (what to track for linear regression):

    • Model metrics: daily RMSE/MAE on a held-out streaming validation sample; distribution of residuals.
    • Input drift: monitor feature distributions (mean, std) and cardinality of categorical features (new zip codes).
    • Prediction health: fraction of predictions with very large residuals; sudden shifts in mean prediction.
    • System metrics: latency of feature assembly and prediction, error rate of API.
  • Drift & Alerting Strategy:

    • Statistical tests: KL divergence or population stability index (PSI) for feature drift.
    • Residual drift: track mean residual over time — steady bias indicates concept drift.
  • Failsafes & Fallbacks:

    • If pipeline fails or predictions spike: fallback to simple rule (neighborhood average) or last known good model.
    • Canary deployments: route small % of traffic to new model to compare production residuals before full rollout.
  • Retraining Triggers:

    • Scheduled retrain (e.g., weekly) + event-based retrain if drift metric exceeds threshold.
  • Explainability & Auditing:

    • Log feature vectors and prediction reasons (top contributing features) for samples flagged by stakeholders.
⚠️
Always design a human-readable fallback. A linear model’s small size is helpful — but feature pipeline failure is common. If feature pipelines break, your model becomes garbage; silence is worse than returning a safe default.

7. Iteration & A/B Testing

  • A/B Testing Design for Regression Outputs:

    • Randomize users or listings into control (current system) vs treatment (new regression model).
    • Primary evaluation: business metric (e.g., conversion rate after showing predicted price, revenue uplift). Secondary: prediction accuracy (MAE) and user-facing KPIs.
    • Choose appropriate statistical tests for continuous outcomes (compare means with t-test / bootstrap; consider metric variance when computing sample sizes).
  • Ramp & Safety:

    • Start with small percent traffic → monitor key metrics (residuals, business metrics) → ramp progressively.
  • Feedback Loop:

    • Capture interactions (accept/reject suggested price, final sale price) and append to training store with timestamps and metadata.
    • Use logged production data for periodic re-training or online updates.
  • Iterative Improvements (linear-specific):

    • Add targeted interaction/polynomial terms that fix observed residual patterns.
    • If residuals show non-linear structure, try piecewise linear models (segmented regressions) before jumping to complex models.
  • Postmortem & Learning:

    • After each A/B test, analyze feature importance, residual slices (by zip, price band), and limitations to prioritize next engineering work.
Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!