1. Linear Regression

1. Linear Regression

5 min read 1065 words

πŸ“ Flashcards

⚑ Short Theories

Regression predicts continuous outputs; classification predicts discrete labels.

Overfitting = high variance, underfitting = high bias.

Gradient descent iteratively adjusts weights to minimize the cost function.

Batch GD uses all data per step; SGD uses one sample; mini-batch balances both.

Scikit-learn’s LinearRegression uses OLS; SGDRegressor uses iterative GD.

🎀 Interview Q&A

Q1: What is Linear Regression and when is it used?

🎯 TL;DR: Linear Regression models the relationship between input features and a continuous target by fitting a linear function.


🌱 Conceptual Explanation

It assumes that the dependent variable can be expressed as a weighted sum of independent variables plus an error term. Think of it like drawing the best straight line through a scatter plot of data.

πŸ“ Technical / Math Details

Univariate case:

$$ \hat{y} = w_0 + w_1x $$


Multivariate case:

$$ \hat{y} = w^T x $$


Where $w_0$ is bias, $w_1, w_2, …, w_n$ are weights, and $x$ is feature vector.

βš–οΈ Trade-offs & Production Notes

  • Fast, interpretable, baseline model.
  • Struggles with non-linear relationships.
  • Sensitive to outliers.

🚨 Common Pitfalls

  • Forgetting to scale features before applying GD.
  • Ignoring multicollinearity.

πŸ—£ Interview-ready Answer

“Linear Regression predicts continuous outcomes by fitting a linear equation between features and target; it’s simple, interpretable, but limited for non-linear data.”

Q2: Explain the cost function in Linear Regression.

🎯 TL;DR: The cost function (MSE) measures average squared error between predicted and true values; we minimize it.


🌱 Conceptual Explanation

The cost function quantifies model error. By minimizing it, the model finds the line of best fit.

πŸ“ Technical / Math Details

$$ J(w) = \frac{1}{2m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 $$
  • $m$: number of samples
  • $\hat{y}^{(i)}$: predicted value
  • $y^{(i)}$: true value

βš–οΈ Trade-offs & Production Notes

  • Convex β†’ guarantees global minimum.
  • Sensitive to outliers since errors are squared.

🚨 Common Pitfalls

  • Confusing cost with evaluation metrics.
  • Using non-scaled features β†’ slow GD convergence.

πŸ—£ Interview-ready Answer

“The cost function in Linear Regression is mean squared error, which penalizes large deviations by squaring residuals.”

Q3: How does Gradient Descent work in Linear Regression?

🎯 TL;DR: Gradient descent iteratively updates weights by moving in the opposite direction of the gradient of the cost function.


🌱 Conceptual Explanation

It’s like walking downhill blindfoldedβ€”each step follows the slope until you reach the valley (minimum cost).

πŸ“ Technical / Math Details

Update rule:

$$ w_j := w_j - \alpha \frac{\partial J(w)}{\partial w_j} $$
  • $\alpha$: learning rate
  • $\frac{\partial J}{\partial w_j}$: gradient

βš–οΈ Trade-offs & Production Notes

  • Batch GD: stable, but slow for large data.
  • SGD: faster, but noisier.
  • Mini-batch: balance between both.

🚨 Common Pitfalls

  • Learning rate too small β†’ slow.
  • Too large β†’ divergence.

πŸ—£ Interview-ready Answer

“Gradient Descent reduces error by updating weights opposite the gradient of the cost until convergence.”

Q4: What’s the difference between Univariate and Multivariate Linear Regression?

🎯 TL;DR: Univariate uses one feature; multivariate uses multiple features to predict the target.


🌱 Conceptual Explanation

Univariate draws a line in 2D space; multivariate draws a hyperplane in higher dimensions.

πŸ“ Technical / Math Details

  • Univariate:
    $$ \hat{y} = w_0 + w_1x $$
  • Multivariate:
    $$ \hat{y} = w^T x $$

βš–οΈ Trade-offs & Production Notes

  • Multivariate captures richer relationships.
  • Risk of multicollinearity with many features.

🚨 Common Pitfalls

  • Using too many irrelevant features β†’ overfitting.

πŸ—£ Interview-ready Answer

“Univariate Linear Regression predicts with one feature, while multivariate uses multiple features, forming a hyperplane.”

Q5: Explain Overfitting vs Underfitting in Regression.

🎯 TL;DR: Overfitting β†’ memorizes training data; underfitting β†’ too simple to learn patterns.


🌱 Conceptual Explanation

Overfit = model too complex, performs poorly on new data. Underfit = model too simple, fails even on training data.

πŸ“ Technical / Math Details

  • Overfit: high variance.
  • Underfit: high bias.

βš–οΈ Trade-offs & Production Notes

  • Use regularization (L1/L2) to combat overfitting.
  • Add features or complexity to reduce underfitting.

🚨 Common Pitfalls

  • Evaluating only on training set.

πŸ—£ Interview-ready Answer

“Overfitting means the model is too complex and generalizes poorly; underfitting means it’s too simple to capture patterns.”

Q6: How does Scikit-learn implement Linear Regression?

🎯 TL;DR: LinearRegression uses closed-form OLS; SGDRegressor uses iterative optimization.


🌱 Conceptual Explanation

Scikit-learn provides both analytical and iterative implementations for Linear Regression, making it easy to use in practice.

πŸ“ Technical / Math Details

  • LinearRegression: solves $$ w = (X^TX)^{-1}X^Ty $$
  • SGDRegressor: applies gradient descent updates.

βš–οΈ Trade-offs & Production Notes

  • OLS is fast for small/medium data.
  • SGD scales better for huge datasets.

🚨 Common Pitfalls

  • Not normalizing input data before SGD.

πŸ—£ Interview-ready Answer

“Scikit-learn’s LinearRegression uses closed-form OLS, while SGDRegressor applies gradient descent for scalability.”

πŸ“ Key Formulas

Univariate Regression Equation
$$ \hat{y} = w_0 + w_1 x $$
  • $w_0$: bias/intercept
  • $w_1$: slope/weight
  • $x$: input feature
    Interpretation: Models output as a straight line function of input.
Multivariate Regression Equation
$$ \hat{y} = w^T x $$
  • $w$: weight vector
  • $x$: feature vector (with bias term $x_0=1$)
    Interpretation: Generalizes line to a hyperplane in higher dimensions.
Mean Squared Error (Cost Function)
$$ J(w) = \frac{1}{2m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 $$
  • $m$: number of samples
  • $y^{(i)}$: true output
  • $\hat{y}^{(i)}$: predicted output
    Interpretation: Average squared deviation between predictions and truth; convex, easy to optimize.
Gradient Descent Update Rule
$$ w_j := w_j - \alpha \frac{\partial J(w)}{\partial w_j} $$
  • $\alpha$: learning rate
  • $\frac{\partial J}{\partial w_j}$: gradient for parameter $w_j$
    Interpretation: Iteratively adjusts weights to minimize cost function.

βœ… Cheatsheet

  • Regression predicts continuous outputs; classification β†’ discrete.
  • Overfitting = high variance, underfitting = high bias.
  • Univariate: one feature β†’ straight line; multivariate: multiple features β†’ hyperplane.
  • Cost function: MSE, convex, optimized via gradient descent.
  • Batch vs SGD vs Mini-batch: trade-off between stability and speed.
  • Scikit-learn: LinearRegression (OLS), SGDRegressor (gradient descent).
Any doubt in content? Ask me anything?
Chat
πŸ€– πŸ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!