Feature Scaling: Linear Regression

Machine Learning Interview Guide for Top Tech Roles (2025)

Linear Regression: Complete Interview Guide for Interviews

3 min read 509 words

🎯 Core Idea

Feature scaling transforms features so they exist on comparable ranges. While Ordinary Least Squares (OLS) regression with a closed-form solution is unaffected by scaling, it is critical for:

Gradient Descent convergence (step size depends on feature magnitude).
Regularization penalties (L1/L2 shrinkage depends on absolute feature values).

🌱 Intuition & Real-World Analogy

Why scaling matters: Imagine walking downhill on a tilted landscape. If one axis stretches much longer than another, your steps zig-zag instead of moving smoothly toward the valley (minimum). That’s gradient descent with unscaled features.
Analogy 1 (shoes): If one shoe is a sneaker and the other a ski boot, you’ll struggle to walk straight. Scaling makes both shoes the same size.
Analogy 2 (currency): Suppose one feature is “salary in dollars” (10,000s) and another is “age in years” (tens). Without scaling, salary dominates simply because of its units, not its predictive power. Scaling puts them in fair competition.

📐 Mathematical Foundation

Gradient Descent Dependence on Scale
- Gradient step:
$$ \theta_j^{(t+1)} = \theta_j^{(t)} - \eta \cdot \frac{\partial J}{\partial \theta_j} $$
If features $x_j$ vary by large magnitudes, then $\frac{\partial J}{\partial \theta_j}$ differs wildly across dimensions → slow, zig-zag convergence.
Standardization (most common):
$$ x'_i = \frac{x_i - \mu}{\sigma} $$
- $x_i$: original feature value
- $\mu$: mean of the feature
- $\sigma$: standard deviation of the feature
Result: mean = 0, variance = 1.
Normalization (Min-Max scaling):
$$ x'_i = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}} $$
Maps values into $[0,1]$. Sensitive to outliers.
Impact on Regularization (L1/L2):
- Ridge: $\lambda \sum_j \theta_j^2$
- Lasso: $\lambda \sum_j |\theta_j|$
Without scaling, features with larger ranges artificially shrink more since coefficients adjust inversely to feature magnitude. Scaling ensures fair shrinkage.

⚖️ Strengths, Limitations & Trade-offs

Strengths

Faster, more stable gradient descent.
Prevents dominance of high-range features.
Ensures fair effect of regularization.

Limitations

Not needed for OLS closed-form (normal equation).
Choice of scaling method (standardization vs. normalization) can affect model behavior.
Outlier sensitivity (min-max scaling heavily distorted by extremes).

Trade-offs

Standardization vs Normalization: standardization handles outliers better, but normalization is bounded and interpretable.

🔍 Variants & Extensions

Robust Scaling: Uses median and interquartile range (IQR). More resistant to outliers:
$$ x'_i = \frac{x_i - \text{median}}{\text{IQR}} $$
Unit Vector Scaling (Normalization to length 1): Useful in text embeddings or cosine similarity contexts:
$$ x' = \frac{x}{\|x\|_2} $$
Adaptive Scaling (e.g., Batch Normalization in Deep Learning): Dynamically scales features during training.

🚧 Common Challenges & Pitfalls

Mistaking necessity: Candidates often think scaling affects OLS coefficients — it doesn’t (but it changes interpretation of coefficients).
Forgetting regularization effects: Without scaling, one feature may be unfairly penalized more than another.
Improper pipeline handling: Scaling must be fit only on training data (leakage risk if test data is scaled using its own statistics).
Over-scaling categorical features: One-hot encoded features are already 0/1 — scaling them often hurts interpretability.

📚 Reference Pointers

ESL by Hastie, Tibshirani, and Friedman – scaling in regression (link)
Deep Learning by Goodfellow, Bengio, and Courville – normalization effects in optimization (Chapter 8)
Batch Normalization paper (Ioffe & Szegedy, 2015): arXiv:1502.03167
Scikit-learn preprocessing docs for practical variants: link

Generalized Linear Models (GLMs): Linear Regression Feature Pipelines: Linear Regression