How to Prevent Overfitting (Validation Strategy)¶

Overfitting occurs when your model learns training noise instead of general patterns. Proper validation is your primary defence.

Signs of Overfitting¶

Metric	Training Set	Test Set	Diagnosis
Accuracy	0.99	0.72	Overfitting — large gap between train and test
Accuracy	0.85	0.83	Good generalisation — small gap

Prevention Strategies¶

1. Cross-Validation¶

Never evaluate on a single train/test split. Use k-Fold CV to get a robust estimate:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

2. Regularisation¶

Add penalties to model complexity (see Regularisation):

from sklearn.linear_model import LogisticRegression

# C controls inverse regularisation strength — smaller C = more regularisation
lr = LogisticRegression(C=0.1, penalty="l2", max_iter=1000)

3. Reduce Model Complexity¶

Constrain hyperparameters to prevent the model from memorising data:

from sklearn.tree import DecisionTreeClassifier

# Limit tree growth
dt = DecisionTreeClassifier(max_depth=5, min_samples_split=10)

4. Early Stopping¶

For iterative algorithms, stop training when validation error starts increasing:

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=1000,
    validation_fraction=0.2,
    n_iter_no_change=10,
    tol=0.001
)

5. More Data¶

Sometimes the simplest fix is more training data. Overfitting is fundamentally a problem of having too many parameters relative to the number of observations.

Common Pitfall

Tuning hyperparameters on the test set causes information leakage. Always use a separate validation set (or nested CV) for tuning, and reserve the test set for the final evaluation only.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.4	Resource constraints and trade-offs	Balancing model complexity, performance, and computational cost
S1	Scientific methods and hypothesis testing	Rigorous cross-validation and statistical model comparison
S4	Building models and validating	Systematic hyperparameter tuning and performance evaluation
B5	Impartial, hypothesis-driven approach	Preventing overfitting; honest reporting of generalisation metrics