7.3. Embedded Methods
🪄 Step 1: Intuition & Motivation
Core Idea: Embedded methods take the best of both worlds — the efficiency of filter methods and the model-awareness of wrapper methods — by selecting features during model training itself.
These methods “bake in” feature selection into the learning process. As the model learns weights or splits, it automatically penalizes or ignores less useful features, keeping only the most informative ones.
Think of it like a chef who trims unnecessary ingredients while cooking, not before or after.
Why It’s Powerful: Because feature selection becomes part of optimization — no separate search or testing loops needed. The model itself decides which features matter most.
🌱 Step 2: Core Concept
Embedded methods select features as a side-effect of model training. Two of the most widely used techniques are:
- Tree-based models — naturally assign importance based on how often and how effectively features split data.
- Regularized linear models — penalize large or unnecessary coefficients to enforce sparsity.
Tree-Based Feature Importance — The Split-Driven Approach
Idea: In tree-based models (like Decision Trees, Random Forests, and Gradient Boosting), features that lead to larger reductions in impurity (e.g., Gini, Entropy, MSE) are deemed more important.
How It Works:
- At each split, the model chooses the feature that best reduces impurity.
- The total reduction in impurity (across all trees and splits) is summed for each feature.
- These sums are normalized to produce feature importance scores.
Mathematical Form (for Gini Importance):
$$ I(f_j) = \sum_{t \in T_j} p(t) \cdot \Delta i(t) $$where:
- $T_j$ → set of all nodes where feature $f_j$ is used
- $p(t)$ → proportion of samples reaching node $t$
- $\Delta i(t)$ → impurity decrease due to feature $f_j$ at node $t$
Example: If “Age” reduces impurity 5 times more than “Income” across all splits, it gets a proportionally higher importance score.
Intuition:
The more a feature helps the model make confident splits, the more “influential” it is.
Implementation Tip:
Use .feature_importances_ in models like RandomForestClassifier, XGBoost, or LightGBM.
Limitations:
- Biased toward continuous features or features with many unique values.
- Importance ≠ causation — high importance just means the feature was useful for splitting, not necessarily causal.
Lasso (L1) Regularization — Sparsity by Design
Idea: In linear models, regularization can be used to penalize large coefficients, forcing some of them to shrink exactly to zero. This means the corresponding features are effectively excluded — a built-in form of feature selection.
Mathematical Form (Lasso Regression):
$$ \text{Loss} = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \hat{y}*i)^2 + \lambda \sum*{j=1}^{p}|\beta_j| $$where:
- $\lambda$ controls the strength of the penalty.
- Larger $\lambda$ → stronger shrinkage → more coefficients become zero.
Why It Works: The $L1$ norm ($|\beta_j|$) creates a “sharp corner” in the optimization landscape, causing some coefficients to hit zero exactly — unlike Ridge ($L2$), which only shrinks them continuously.
Effect:
- Unimportant or redundant features → coefficients become 0.
- Important features → coefficients survive.
Example: If you have 100 features but only 10 truly matter, Lasso can automatically zero out the other 90 — leaving a sparse, interpretable model.
Key Insight (from the probing question):
When two correlated features exist, Lasso often keeps one and drops the other arbitrarily. This improves sparsity but can hurt interpretability — you lose clarity about which specific variable was responsible.
Implementation Tip: In scikit-learn:
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X, y)
selected_features = X.columns[model.coef_ != 0]How It Fits in ML Thinking
Embedded methods represent model-conscious feature selection — features are evaluated by how much they contribute to the model’s learning objective.
- In tree models, importance emerges from how often a feature helps make confident splits.
- In regularized models, importance emerges from how much the model “needs” that coefficient to minimize error.
They bridge the gap between statistical relevance (filter) and empirical validation (wrapper) — providing the best of both.
📐 Step 3: Mathematical Foundation
1️⃣ Lasso Regularization Objective
- The penalty term $\lambda \sum |\beta_j|$ encourages sparsity.
- If $\lambda$ is large → more coefficients shrink to 0.
Geometric Intuition: L1 penalty constrains $\beta$ inside a diamond-shaped boundary. The corners of this diamond cause solutions to “stick” to the axes (i.e., $\beta_j = 0$).
2️⃣ Tree-Based Feature Importance (Gini or Entropy Decrease)
Each time feature $f_j$ is used in a split, it reduces the impurity of the dataset. Summing across all trees gives a total importance score.
🧠 Step 4: Assumptions or Key Ideas
- Lasso assumes linear relationships between features and target.
- Tree importance assumes greedy splitting accurately reflects contribution.
- Both assume features are scaled appropriately (Lasso especially).
- Correlated features may distort importance — Lasso drops one, tree models distribute importance arbitrarily among them.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Built-in feature selection — no separate step needed.
- Efficient — performed during model training.
- Works well with high-dimensional data.
- Regularization improves generalization and prevents overfitting.
- Tree-based importance can be biased toward numeric or high-cardinality features.
- Lasso struggles with correlated features (drops one arbitrarily).
- Coefficient shrinkage may underrepresent weak but meaningful predictors.
- Use Lasso when you need sparse, interpretable linear models.
- Use Tree-based models when you expect nonlinear interactions.
- Combine them for complementary insights — e.g., use Lasso to prune features, then validate with tree importances.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“Lasso and Ridge are the same.” Ridge (L2) shrinks coefficients but never sets them to zero — Lasso (L1) enforces sparsity.
“Feature importance = causal importance.” High importance doesn’t mean causation — just contribution to prediction.
“Tree importances are always trustworthy.” They can be biased; use permutation importance for validation.
🧩 Step 7: Mini Summary
🧠 What You Learned: Embedded methods perform feature selection during training — either by penalizing unhelpful features (Lasso) or prioritizing splitting power (trees).
⚙️ How It Works: Tree-based models rank features by impurity reduction; Lasso forces unimportant coefficients to zero.
🎯 Why It Matters: Because efficient models aren’t just accurate — they’re focused — learning only from features that truly move the needle.