How to Prevent Overfitting (Validation Strategy)¶
Overfitting occurs when your model learns training noise instead of general patterns. Proper validation is your primary defence.
Signs of Overfitting¶
| Metric | Training Set | Test Set | Diagnosis |
|---|---|---|---|
| Accuracy | 0.99 | 0.72 | Overfitting — large gap between train and test |
| Accuracy | 0.85 | 0.83 | Good generalisation — small gap |
Prevention Strategies¶
1. Cross-Validation¶
Never evaluate on a single train/test split. Use k-Fold CV to get a robust estimate:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
2. Regularisation¶
Add penalties to model complexity (see Regularisation):
from sklearn.linear_model import LogisticRegression
# C controls inverse regularisation strength — smaller C = more regularisation
lr = LogisticRegression(C=0.1, penalty="l2", max_iter=1000)
3. Reduce Model Complexity¶
Constrain hyperparameters to prevent the model from memorising data:
from sklearn.tree import DecisionTreeClassifier
# Limit tree growth
dt = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
4. Early Stopping¶
For iterative algorithms, stop training when validation error starts increasing:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(
n_estimators=1000,
validation_fraction=0.2,
n_iter_no_change=10,
tol=0.001
)
5. More Data¶
Sometimes the simplest fix is more training data. Overfitting is fundamentally a problem of having too many parameters relative to the number of observations.
Common Pitfall
Tuning hyperparameters on the test set causes information leakage. Always use a separate validation set (or nested CV) for tuning, and reserve the test set for the final evaluation only.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.4 | Resource constraints and trade-offs | Balancing model complexity, performance, and computational cost |
| S1 | Scientific methods and hypothesis testing | Rigorous cross-validation and statistical model comparison |
| S4 | Building models and validating | Systematic hyperparameter tuning and performance evaluation |
| B5 | Impartial, hypothesis-driven approach | Preventing overfitting; honest reporting of generalisation metrics |