Feature Leakage¶

Target Leakage destroys algorithmic validity by revealing the answer to the model during training.

The Concept¶

Target Leakage (also called Data Leakage) occurs when you engineer a feature that contains information which will not be available at prediction time in the real world.

Your model achieves near-perfect accuracy during training and validation, but collapses entirely when deployed to production — because the leaked signal no longer exists.

Example: The Churn Prediction Disaster¶

Imagine building a model to predict whether a user will cancel their subscription (churn) next month.

During feature engineering, you create a column: has_called_cancellation_hotline_last_30_days.

The Leakage Flaw:

Your algorithm discovers that has_called_cancellation_hotline has a 99.9% correlation with the target variable churned. It assigns almost all predictive weight to this single feature, achieving 99% accuracy in cross-validation.

The Production Failure:

When you deploy the model to predict churn for next month, that column does not yet exist — you cannot know today whether a customer will call the cancellation hotline over the coming 30 days. The model's star feature is empty, and predictions collapse to random guessing.

How to Detect Leakage¶

Suspiciously high accuracy. If your model achieves > 95% accuracy with minimal tuning, investigate which features are driving it.
Single-feature dominance. If one feature has overwhelming importance, check whether it is temporally valid.
Timeline audit. For every feature, ask: "Would I physically have this value before the event I am predicting?"

The Solution¶

You must enforce a strict temporal cutoff: only include features derived from data available before the prediction point.

# Correct: features derived from data BEFORE the prediction window
df["calls_previous_quarter"] = df.groupby("user_id")["call_date"].transform(
    lambda x: x.shift(1).rolling("90D").count()
)

Common Pitfall

Fitting a scaler or encoder on the full dataset (including test data) before splitting is also a form of leakage. Always fit transformers on the training set only, then apply .transform() to the test set.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced analytics and ML techniques	Feature selection algorithms and dimensionality reduction
K5.2	Data formats and structures	Encoding categorical variables, handling mixed feature types
S2	Data engineering	Creating and transforming features from raw data
S4	Feature selection and ML	Applying feature selection methods and PCA
B1	Inquisitive approach	Exploring creative feature engineering strategies