How to Prevent Overfitting¶
Overfitting occurs when your algorithm memorises the noise and anomalies in your training set, then fails on any unseen test data.
Method 1: Hyperparameter Constraints (Pruning)¶
Unbounded algorithms (Decision Trees, Neural Networks) will expand indefinitely until training error hits zero. You must restrict their growth.
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df = sns.load_dataset("titanic").dropna(subset=["age", "fare", "survived"])
X = df[["age", "fare"]]
y = df["survived"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
# Unbound Tree (Overfitting)
bad_tree = DecisionTreeClassifier(random_state=42)
bad_tree.fit(X_tr, y_tr)
print(f"Unbound Train Accuracy: {bad_tree.score(X_tr, y_tr):.2f}")
print(f"Unbound Test Accuracy: {bad_tree.score(X_te, y_te):.2f}")
The unbound tree memorised the training noise (99% vs 61% — a huge gap).
Now constrain it using max_depth (limits tree depth) and min_samples_split (requires a minimum number of samples to split a node):
good_tree = DecisionTreeClassifier(max_depth=4, min_samples_split=10, random_state=42)
good_tree.fit(X_tr, y_tr)
print(f"Constrained Train Accuracy: {good_tree.score(X_tr, y_tr):.2f}")
print(f"Constrained Test Accuracy: {good_tree.score(X_te, y_te):.2f}")
The scores are now closely aligned, proving the algorithm generalises rather than memorises.
Method 2: Early Stopping (Iterative Models)¶
For Gradient Boosting or Neural Networks, you can monitor validation error during training and halt automatically when performance stops improving.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(
n_estimators=1000,
validation_fraction=0.2,
n_iter_no_change=5,
tol=0.01,
random_state=42
)
gb.fit(X_tr, y_tr)
print(f"Algorithm stopped at iteration: {gb.n_estimators_}")
Method 3: Regularisation¶
For linear models, apply L1 (Lasso) or L2 (Ridge) penalties to shrink or eliminate coefficients. See the Regularisation Explained page for details.
Workplace Tip
Regularisation is the preferred method for preventing overfitting in linear models; hyperparameter constraints (pruning) are the equivalent for tree-based models.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.1 | Statistical models and methods | Understanding the statistical basis of regression and classification |
| K4.2 | ML and AI techniques | Implementing and comparing supervised learning algorithms |
| K4.4 | Resource constraints and trade-offs | Model complexity vs interpretability; computational cost |
| S1 | Scientific methods and hypothesis testing | Formulating hypotheses and testing with rigorous validation |
| S4 | Building models and validating | Cross-validation, train/test evaluation, performance metrics |
| B5 | Impartial, hypothesis-driven approach | Honest evaluation of model performance and limitations |