Logistic Regression

5 min read 1062 words

🤖 Core ML Fundamentals

Note

The Top Tech Company Angle (Logistic Regression):
This topic is a litmus test for understanding probabilistic modeling, optimization, and decision boundaries. Interviewers use it to assess whether you can move beyond linear regression to handle classification problems while reasoning about log-odds, sigmoid transformations, and maximum likelihood estimation (MLE).
Candidates who deeply understand logistic regression demonstrate mastery over the bridge between traditional statistics and modern machine learning.

1.1: Master the Intuition and Core Theory

Start with the problem setup: Unlike Linear Regression, where outputs are continuous, Logistic Regression predicts probabilities — bounded between 0 and 1.
Formula:
\[ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}} \]
Understand the log-odds transformation:
\[ \log\left(\frac{p}{1-p}\right) = X\beta \]
This linearizes the probability, allowing us to model it with a linear function.
Build intuition for the sigmoid curve: how small changes near 0 in input produce large probability swings (high sensitivity).
Recognize that logistic regression is a discriminative model, directly modeling \( P(y|x) \) — not \( P(x, y) \).

Deeper Insight:
In interviews, you may be asked: “Why can’t we just use Linear Regression for classification?”
The key lies in the bounded output and the non-linear relationship between features and probabilities. Logistic Regression ensures predicted probabilities stay between 0 and 1, avoiding nonsensical predictions (like -0.3 probability).

1.2: Dive into the Cost Function — The Log-Likelihood

Linear Regression minimizes MSE; Logistic Regression maximizes the likelihood of observing the given labels.
The cost (to minimize) is the negative log-likelihood (cross-entropy):
\[ J(\beta) = -\frac{1}{m} \sum_{i=1}^{m} \big[y_i\log(\hat{y_i}) + (1 - y_i)\log(1 - \hat{y_i})\big] \]
Understand why we use this:
- MSE would produce non-convex optimization for logistic regression.
- The log-likelihood gives a convex function, ensuring a unique global minimum.
Grasp the gradient derivation:
\[ \frac{\partial J}{\partial \beta_j} = \frac{1}{m} \sum_i (\hat{y_i} - y_i) x_{ij} \]
Notice the similarity to linear regression’s gradient — it’s just the error term adjusted for probabilities.

Probing Question:
“Why is the log-likelihood convex for logistic regression but not for other models like neural networks?”
Hint: It’s due to the sigmoid’s monotonicity and linearity in parameters before applying the transformation.

1.3: Train Using Gradient Descent (From Scratch)

Implement the training loop using NumPy: initialize weights, compute the sigmoid, calculate loss, and update weights.
Apply batch, stochastic, and mini-batch gradient descent. Be ready to explain trade-offs:
- Batch: stable but slower.
- Stochastic: noisy but fast.
- Mini-batch: sweet spot.
Derive the vectorized update step:
\[ \beta := \beta - \alpha \cdot \frac{1}{m} X^T (\hat{y} - y) \]
Validate by comparing against sklearn.linear_model.LogisticRegression outputs.

Deeper Insight:
A frequent follow-up: “What if your model converges too slowly?”
Discuss learning rate tuning, feature scaling, or using adaptive optimizers (though they’re more common in deep learning).
Showing awareness of computational trade-offs is a key signal of depth.

1.4: Interpretability and Coefficients

Understand how to interpret coefficients in terms of log-odds:
Each coefficient \( \beta_j \) represents the change in log-odds for a unit change in \( x_j \).
For interpretability:
\[ e^{\beta_j} = \text{odds ratio} \]
Discuss scaling: coefficients’ interpretability depends on feature scaling — unscaled features distort the relative importance.
Mention feature correlation pitfalls (multicollinearity) — it inflates variances and destabilizes coefficients.

Probing Question:
“Your model gives a negative coefficient for a feature you expected to be positive. Why might that happen?”
Possible answers: multicollinearity, confounding variables, or small data variance.

⚙️ Regularization & Model Generalization

Note

The Top Tech Company Angle (Regularization in Logistic Regression):
This is used to evaluate how candidates handle bias–variance trade-offs and prevent overfitting. Knowing how and when to use L1 (Lasso) vs L2 (Ridge) shows you can think critically about model robustness.

2.1: Understand Regularized Logistic Regression

Add the penalty term to the cost:
- L1 Regularization (Lasso):
  \[ J(\beta) = J_{original} + \lambda \sum |\beta_j| \]
  Encourages sparsity (feature selection).
- L2 Regularization (Ridge):
  \[ J(\beta) = J_{original} + \lambda \sum \beta_j^2 \]
  Shrinks coefficients smoothly, improving stability.
Explain why regularization helps — by reducing model complexity and variance.

Deeper Insight:
Interviewers often test your intuition here: “How would you decide between L1 and L2?”
L1 → When you expect only a few features are important.
L2 → When you believe all features contribute, but to varying extents.

2.2: Tune Hyperparameters and Evaluate

Use GridSearchCV to tune the regularization parameter \( C = 1/\lambda \).
Validate performance using stratified cross-validation (important for imbalanced datasets).
Compare results visually using ROC curves, AUC, and confusion matrices.

Probing Question:
“What happens if you set \( \lambda = 0 \)? What if \( \lambda \) is very large?”
Be prepared to discuss overfitting vs. underfitting extremes.

🧠 Advanced Understanding & Practical Considerations

Note

The Top Tech Company Angle (Advanced Logistic Regression):
Here, you’re tested on scalability, decision thresholds, and probabilistic reasoning — crucial for production ML systems that must operate under uncertainty.

3.1: Decision Thresholds and Metrics

Default threshold = 0.5, but that’s not always optimal. Learn to adjust thresholds for business goals (e.g., higher recall vs. precision).
Explore metrics: precision, recall, F1, ROC-AUC, PR curves.
Use calibration plots to evaluate whether predicted probabilities align with actual outcomes.

Probing Question:
“How would you adjust the threshold if false negatives are more costly?”
This tests your practical judgment and business-awareness.

3.2: Handling Imbalanced Data

Explore class weighting, resampling, and SMOTE.
Be able to justify when and why to use each method.
Show awareness that metrics like accuracy can be misleading for imbalanced datasets.

Deeper Insight:
A common trick question: “Why not just oversample the minority class indefinitely?”
The answer: it increases overfitting and does not generate new information — hence the preference for synthetic approaches like SMOTE.

3.3: Logistic Regression at Scale

Discuss parallelized gradient descent and distributed implementations (e.g., using Spark MLlib).
Mention convergence acceleration techniques like Newton-Raphson or L-BFGS.
Be aware of memory bottlenecks — for large feature sets, sparse matrix representations are crucial.

Probing Question:
“You’re training logistic regression on 10M samples with 1M features — what’s your approach?”
Top candidates mention feature hashing, online learning, or dimensionality reduction before training.

3.4: Multiclass Logistic Regression

Learn One-vs-Rest (OvR) and Softmax (Multinomial) strategies.
Understand how softmax generalizes sigmoid:
\[ P(y=k|x) = \frac{e^{x^T\beta_k}}{\sum_{j} e^{x^T\beta_j}} \]
Know the computational trade-offs between OvR and multinomial implementations.

Probing Question:
“How does multiclass logistic regression differ from a neural network’s final layer?”
Both use softmax, but neural nets learn nonlinear feature transformations before applying it.

3.4 Multiclass Logistic Regression