Core Machine Learning — Foundational Theory

5 min read 858 words

1️⃣ Bias–Variance Tradeoff

Note

The Top Tech Interview Angle:
This is the most fundamental concept in model generalization. Interviewers use it to check whether you understand the mathematical roots of overfitting and how to balance complexity with performance.
Your ability to explain this tradeoff with both intuition and equations is a core differentiator in top tech interviews.

Learning Steps

  1. Understand the Error Decomposition:
    The total prediction error can be decomposed as:

    \[ E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]

    where:

    • Bias: Error due to simplifying assumptions in the model.
    • Variance: Error due to model’s sensitivity to small data changes.
    • Irreducible Error: Noise inherent in data that no model can fix.
  2. Develop Intuition:

    • High bias → Model too simple (e.g., linear line on nonlinear data).
    • High variance → Model too flexible (e.g., overfitting polynomial).
    • Ideal tradeoff → Model captures the underlying signal, not the noise.
  3. Practice Visualization:

    • Plot train vs. validation error for models of increasing complexity.
    • Observe how validation error first decreases (bias ↓) and then increases (variance ↑).

Deeper Insights & Probing Questions
“What happens when your model has high training accuracy but low validation accuracy?”
→ That’s high variance. The model memorizes training data.
“How do you reduce variance?”
→ Collect more data, apply regularization, or simplify the model.


2️⃣ Overfitting vs. Underfitting

Note

The Top Tech Interview Angle:
This concept tests whether you can interpret model diagnostics (learning curves, validation scores) and reason about how to fix them.
Top interviewers expect you to connect these terms directly to bias–variance tradeoff and regularization.

Learning Steps

  1. Define & Identify:

    • Overfitting: Model captures noise → excellent on train, poor on test.
    • Underfitting: Model fails to capture the signal → poor on both train & test.
  2. Use Learning Curves:

    • Plot training & validation loss vs. epochs or complexity.
    • Overfitting → training loss ↓, validation loss ↑.
    • Underfitting → both losses high.
  3. Mitigation Strategies:

    • Overfitting → Add regularization, collect more data, apply dropout (in DL), reduce complexity.
    • Underfitting → Add more features, use a more flexible model, or train longer.

Deeper Insights & Probing Questions
“How would you identify overfitting without a validation set?”
→ Use cross-validation or regularization penalties as proxies.
“If a model overfits, does that mean it’s useless?”
→ Not always. In production, regularization and calibration can recover generalization.


3️⃣ L1 vs L2 Regularization

Note

The Top Tech Interview Angle:
Regularization questions probe your mathematical maturity and understanding of optimization.
You’ll be asked to explain why L1 induces sparsity, how L2 stabilizes weights, and how both affect gradient descent convergence.

Learning Steps

  1. Formulate the Objective Function:
    Regularization adds a penalty term to discourage large coefficients:

    \[ J(\theta) = \frac{1}{2m}\sum (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \|\theta\|_p \]

    where \(p=1\) (L1/Lasso) or \(p=2\) (L2/Ridge).

  2. Understand the Geometric Intuition:

    • L1: Diamond-shaped constraint → promotes zero weights → sparsity.
    • L2: Circular constraint → shrinks all weights smoothly → stability.
  3. Code From Scratch (NumPy):
    Implement gradient updates for both:

    • L2 gradient: \( \nabla = X^T(X\theta - y) + 2\lambda\theta \)
    • L1 gradient: \( \nabla = X^T(X\theta - y) + \lambda \, \text{sign}(\theta) \)

Deeper Insights & Probing Questions
“When would you prefer L1 over L2?”
→ When feature selection or sparsity is desired.
“Can we combine both?”
→ Yes, Elastic Net combines them to balance shrinkage and sparsity.


4️⃣ Cross-Validation

Note

The Top Tech Interview Angle:
Cross-validation tests whether you truly understand data leakage prevention and robust performance estimation.
You’re expected to discuss bias-variance of validation estimates, k-fold choices, and computational trade-offs.

Learning Steps

  1. Understand the Purpose:
    CV estimates model generalization on unseen data using multiple train–test splits.

    • Common types: K-Fold, Stratified K-Fold, Leave-One-Out.
  2. K-Fold Process:

    • Split dataset into K parts.
    • Train on (K−1) folds, validate on remaining one.
    • Repeat for all folds, then average results.
  3. Implementation:

    • Use sklearn.model_selection.KFold or StratifiedKFold.
    • Practice nested CV for hyperparameter tuning.

Deeper Insights & Probing Questions
“Why is StratifiedKFold preferred for classification?”
→ It preserves class balance across folds.
“Why might validation scores vary widely across folds?”
→ Data heterogeneity, small datasets, or model instability.


5️⃣ Evaluation Metrics

Note

The Top Tech Interview Angle:
This section tests your model interpretation skills — not just computing metrics, but knowing which metric fits which problem.
Expect scenarios like: “Your model has 95% accuracy but poor recall — why?”

Learning Steps

  1. Classification Metrics Overview:

    MetricFormulaInterpretation
    Accuracy\( \frac{TP + TN}{TP + FP + TN + FN} \)Overall correctness
    Precision\( \frac{TP}{TP + FP} \)Fraction of positive predictions that are correct
    Recall (Sensitivity)\( \frac{TP}{TP + FN} \)Fraction of actual positives detected
    F1-Score\( 2 \times \frac{Precision \times Recall}{Precision + Recall} \)Balance between precision & recall
    ROC-AUCArea under ROC curveRanking ability of classifier
  2. Trade-offs:

    • High precision → few false positives.
    • High recall → few false negatives.
    • Choose based on business cost (e.g., spam filter vs. cancer detection).
  3. Regression Metrics:

    • MSE, RMSE, MAE, MAPE, R².
    • Know when each is appropriate (e.g., MAPE unsuitable if y=0).

Deeper Insights & Probing Questions
“When do you prefer F1 over Accuracy?”
→ For imbalanced data.
“Why might ROC-AUC be misleading on skewed datasets?”
→ Because it doesn’t reflect actual precision–recall performance under imbalance.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!