Skip to content

Filter Methods

Feeding perfectly predictive features to an algorithm generates high accuracy. Feeding useless features to an algorithm generates disastrous noise. We must filter.

What You Will Learn

  • Differentiate Filter logic from Wrapper/Embedded logic
  • Use Variance Thresholding to automatically eliminate constant values
  • Execute SelectKBest mathematically targeting ANOVA distributions

Prerequisites

  • Basic understanding of correlations
  • Completed engineering numerical subsets

Step 1: The Concept of Filtering

A "Filter" method objectively evaluates every feature singularly against the Target Variable using raw statistical mathematics (like Correlation, Chi-Square, or ANOVA).

It explicitly does not train a Machine Learning model. Because no models are trained, Filter methods are blindingly computationally fast. You execute them first to brutally slash 100,000 text columns down to a manageable 500 coordinates.

Step 2: Variance Thresholding (The Minimum Baseline)

If a column is utterly structurally constant (e.g. is_Earth=True for every human recorded), it provides 0% predictive lift. VarianceThreshold mathematically drops any column where the numerical variance falls beneath a defined threshold.

import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Synthetic Data: Feature 2 is identical for all users
df = pd.DataFrame({
    'Height': [170, 180, 160, 190],
    'Planet': [1, 1, 1, 1],
    'Weight': [70, 85, 60, 95]
})

# Drop any column that has literally zero variance (all identical)
selector = VarianceThreshold(threshold=0.0)
df_filtered = pd.DataFrame(selector.fit_transform(df))

# Which columns survived? 
surviving_cols = df.columns[selector.get_support()]
print(f"Survived Columns: {list(surviving_cols)}")
Expected Output
Survived Columns: ['Height', 'Weight']

Step 3: SelectKBest (ANOVA)

SelectKBest scores every explicit feature mathematically and strictly retains the optimal "K" (top N) features producing the largest signal bounds.

For continuous features mapping against a categorical target (e.g., predicting Penguin Species utilizing float values like flipper_length), calculating the ANOVA F-Value score is industry compliant.

import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif

df = sns.load_dataset('penguins').dropna()

# Extract exactly the numeric variables to examine
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']

# We request SelectKBest to evaluate ALL columns just to see the mathematical scores
selector_f = SelectKBest(score_func=f_classif, k='all')
selector_f.fit(X, y)

# Construct a clean DataFrame to output rankings
scores = pd.DataFrame({
    'Feature': X.columns, 
    'ANOVA F-Score': selector_f.scores_
}).sort_values(by='ANOVA F-Score', ascending=False)

print(scores.round(2))
Expected Output
             Feature  ANOVA F-Score
2  flipper_length_mm         593.59
0     bill_length_mm         410.60
1      bill_depth_mm         359.85
3        body_mass_g         343.63

In a live production environment, instead of k='all', you would write k=2 and the transformer would implicitly cleanly drop body_mass and bill_depth physically from the dataset matrix returning specifically the optimal top 2 tensors.

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.barplot(data=scores, x='ANOVA F-Score', y='Feature', palette='viridis')
plt.title('ANOVA F-Value Feature Importance')
plt.tight_layout()
plt.show()
Expected Plot

ANOVA Feature Importance

Assessment Connection

You are required by section 3 of the EPA grading guidelines to mathematically explicitly justify your dimensional reduction strategy. Showing the examiner your SelectKBest F-Score distribution plots prevents accusations of "arbitrary" deletion natively.

Summary

  • Filter Methods evaluate variables totally independently of ML algorithms using pure statistics.
  • Use VarianceThreshold to programmatically delete constant boolean arrays.
  • Use SelectKBest parameterized with f_classif or chi2 to keep precisely the highest quantitative analytical arrays.

Next Steps

Wrapper Methods — why filter methods fail to consider multicollinearity bounds, and why algorithms must iteratively test dimensional structures.

Stretch & Challenge

For Advanced Learners

Mutual Information for Non-Linear Distributions

ANOVA F-Scores (f_classif) strictly measure linear boundaries. If your target maps against a feature on a parabolic or circular curve, ANOVA will grade the feature as 0 completely erroneously.

Instead, you must compute the entropy using mutual_info_classif.

from sklearn.feature_selection import mutual_info_classif

selector_mi = SelectKBest(score_func=mutual_info_classif, k=2)
X_mi = selector_mi.fit_transform(X, y)

mutual_info_classif is computationally much heavier than ANOVA but natively discovers explosive nonlinear intersections perfectly!

KSB Mapping

KSB Description How This Addresses It
K4.2 Advanced analytics and ML techniques Feature selection algorithms and dimensionality reduction
K5.2 Data formats and structures Encoding categorical variables, handling mixed feature types
S2 Data engineering Creating and transforming features from raw data
S4 Feature selection and ML Applying feature selection methods and PCA
B1 Inquisitive approach Exploring creative feature engineering strategies