7.1. Filter Methods
🪄 Step 1: Intuition & Motivation
Core Idea: In machine learning, not all features deserve equal attention — some are informative, others redundant, and a few are pure noise. The more irrelevant features you feed your model, the harder it becomes for it to focus on meaningful patterns.
Filter methods are the first line of defense — quick, statistical tools that rank or remove irrelevant features before modeling.
Think of it as cleaning your lens before taking a photo — if the input view is fuzzy, no model can see clearly.
Simple Analogy: Imagine you’re judging a singing contest. Before you even listen to the full song, you can eliminate contestants who can’t hit a single note. Filter methods work similarly — they remove obviously unhelpful features early, so the model can focus on the promising ones.
🌱 Step 2: Core Concept
Filter methods assess the relationship between each feature and the target variable independently — without involving any specific ML model. They rely on statistical tests or correlation measures to score and rank features.
They’re fast, interpretable, and ideal for initial dimensionality reduction, especially with large datasets.
Correlation-Based Feature Selection
Goal: Measure how strongly each feature correlates with the target variable (and among themselves).
- For continuous features: use Pearson correlation coefficient ($r$).
- For categorical features: use Cramér’s V or Point Biserial correlation (if target is binary).
Formula (Pearson’s r):
$$ r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}} $$Interpretation:
- $r = 1$ → perfectly positive relationship
- $r = -1$ → perfectly negative relationship
- $r = 0$ → no linear relationship
Practical Step:
- Keep features with |r| above a certain threshold (e.g., 0.3).
- Remove redundant features that are highly correlated with each other (multicollinearity).
Limitation: Captures only linear relationships — may miss nonlinear patterns.
Chi-Square (χ²) Test for Categorical Features
Goal: Check if two categorical variables (feature and target) are independent or related.
If they’re independent → feature provides no information about the target. If dependent → feature is informative.
Formula:
$$ \chi^2 = \sum \frac{(O - E)^2}{E} $$where:
- $O$ = observed frequency
- $E$ = expected frequency (if variables were independent)
High χ² means large deviation from independence → strong relationship.
Steps:
- Create a contingency table of feature vs target.
- Compute expected frequencies.
- Calculate χ² statistic and p-value.
- Select features with significant association (p < 0.05).
Example: If the feature “Marital Status” has a high χ² value with the target “Loan Default,” it’s likely relevant.
Limitation:
- Works only with categorical data.
- Doesn’t indicate direction or strength — only significance.
Mutual Information — The Information-Theoretic Lens
Goal: Quantify how much information a feature provides about the target — applicable to both categorical and continuous data.
Mutual Information (MI) measures the reduction in uncertainty about the target when we know the feature.
Formula:
$$ I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log{\frac{p(x, y)}{p(x)p(y)}} $$Interpretation:
- $I(X;Y) = 0$ → variables are independent (no information shared).
- Higher $I(X;Y)$ → stronger dependency (feature is informative).
Advantages:
- Captures both linear and nonlinear relationships.
- Works for both categorical and continuous variables (via discretization).
Implementation Tip:
Use mutual_info_classif() or mutual_info_regression() from scikit-learn.
Limitation: Computationally heavier than correlation and χ²; harder to interpret intuitively.
How It Fits in ML Thinking
Filter methods act as model-agnostic gatekeepers — ensuring only relevant signals reach the model training stage.
They’re especially valuable when:
- You have many features but limited samples.
- You need to remove noise quickly before applying model-based selection.
However, they’re blind to model context — a feature might score high statistically but still perform poorly in a nonlinear model. Hence, they’re best used as a first filter, not the final selection.
📐 Step 3: Mathematical Foundation
1️⃣ Correlation Coefficient (Pearson’s r)
- Measures linear association.
- Range: [-1, 1].
- $r^2$ gives proportion of variance explained.
2️⃣ Chi-Square Statistic
where:
- $O$ = observed count in each cell
- $E$ = expected count under independence assumption
Higher χ² means stronger dependence between variables.
3️⃣ Mutual Information
It measures shared information between $X$ and $Y$ — how much knowing one reduces uncertainty about the other.
🧠 Step 4: Assumptions or Key Ideas
- Features are independently tested — interactions are ignored.
- No underlying model is used (purely statistical).
- Scaling and normalization are important before applying these metrics.
- Suitable as a pre-filter before wrapper or embedded methods.
⚖️ Step 5: Strengths, Limitations & Trade-offs
- Fast and computationally light.
- Model-agnostic and easy to interpret.
- Good for quick feature pruning on large datasets.
- Works as a foundation for pipeline-based feature engineering.
- Ignores feature interactions.
- Cannot capture model-specific nuances.
- Sensitive to data type (requires correct method for categorical vs continuous).
- Use Filter Methods early for rapid dimensionality reduction.
- Combine later with Wrapper (RFE) or Embedded (Lasso) methods for deeper optimization.
- Don’t rely solely on them for final model tuning — they focus on statistical, not predictive, relationships.
🚧 Step 6: Common Misunderstandings
🚨 Common Misunderstandings (Click to Expand)
“High correlation always means good feature.” Not necessarily — correlated features may cause multicollinearity or redundancy.
“Filter methods understand the model.” No — they ignore model behavior entirely.
“Mutual information gives direction.” It doesn’t; MI is symmetric — it tells you there’s a relationship, not which variable depends on the other.
🧩 Step 7: Mini Summary
🧠 What You Learned: Filter methods rank features statistically using correlation, Chi-Square, or Mutual Information.
⚙️ How It Works: They evaluate how each feature individually relates to the target — no model needed.
🎯 Why It Matters: Because great ML begins with great inputs — and filter methods help you separate the signal from the static early in the process.