Feature Leakage¶
Target Leakage destroys algorithmic validity by revealing the answer to the model during training.
The Concept¶
Target Leakage (also called Data Leakage) occurs when you engineer a feature that contains information which will not be available at prediction time in the real world.
Your model achieves near-perfect accuracy during training and validation, but collapses entirely when deployed to production — because the leaked signal no longer exists.
Example: The Churn Prediction Disaster¶
Imagine building a model to predict whether a user will cancel their subscription (churn) next month.
During feature engineering, you create a column: has_called_cancellation_hotline_last_30_days.
The Leakage Flaw:
Your algorithm discovers that has_called_cancellation_hotline has a 99.9% correlation with the target variable churned. It assigns almost all predictive weight to this single feature, achieving 99% accuracy in cross-validation.
The Production Failure:
When you deploy the model to predict churn for next month, that column does not yet exist — you cannot know today whether a customer will call the cancellation hotline over the coming 30 days. The model's star feature is empty, and predictions collapse to random guessing.
How to Detect Leakage¶
- Suspiciously high accuracy. If your model achieves > 95% accuracy with minimal tuning, investigate which features are driving it.
- Single-feature dominance. If one feature has overwhelming importance, check whether it is temporally valid.
- Timeline audit. For every feature, ask: "Would I physically have this value before the event I am predicting?"
The Solution¶
You must enforce a strict temporal cutoff: only include features derived from data available before the prediction point.
# Correct: features derived from data BEFORE the prediction window
df["calls_previous_quarter"] = df.groupby("user_id")["call_date"].transform(
lambda x: x.shift(1).rolling("90D").count()
)
Common Pitfall
Fitting a scaler or encoder on the full dataset (including test data) before splitting is also a form of leakage. Always fit transformers on the training set only, then apply .transform() to the test set.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced analytics and ML techniques | Feature selection algorithms and dimensionality reduction |
| K5.2 | Data formats and structures | Encoding categorical variables, handling mixed feature types |
| S2 | Data engineering | Creating and transforming features from raw data |
| S4 | Feature selection and ML | Applying feature selection methods and PCA |
| B1 | Inquisitive approach | Exploring creative feature engineering strategies |