How to Prevent Overfitting¶

Overfitting occurs when your algorithm memorises the noise and anomalies in your training set, then fails on any unseen test data.

Method 1: Hyperparameter Constraints (Pruning)¶

Unbounded algorithms (Decision Trees, Neural Networks) will expand indefinitely until training error hits zero. You must restrict their growth.

import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

df = sns.load_dataset("titanic").dropna(subset=["age", "fare", "survived"])
X = df[["age", "fare"]]
y = df["survived"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

# Unbound Tree (Overfitting)
bad_tree = DecisionTreeClassifier(random_state=42)
bad_tree.fit(X_tr, y_tr)
print(f"Unbound Train Accuracy: {bad_tree.score(X_tr, y_tr):.2f}")
print(f"Unbound Test Accuracy:  {bad_tree.score(X_te, y_te):.2f}")

Expected Output

Unbound Train Accuracy: 0.99
Unbound Test Accuracy:  0.61

The unbound tree memorised the training noise (99% vs 61% — a huge gap).

Now constrain it using max_depth (limits tree depth) and min_samples_split (requires a minimum number of samples to split a node):

good_tree = DecisionTreeClassifier(max_depth=4, min_samples_split=10, random_state=42)
good_tree.fit(X_tr, y_tr)
print(f"Constrained Train Accuracy: {good_tree.score(X_tr, y_tr):.2f}")
print(f"Constrained Test Accuracy:  {good_tree.score(X_te, y_te):.2f}")

Expected Output

Constrained Train Accuracy: 0.72
Constrained Test Accuracy:  0.69

The scores are now closely aligned, proving the algorithm generalises rather than memorises.

Method 2: Early Stopping (Iterative Models)¶

For Gradient Boosting or Neural Networks, you can monitor validation error during training and halt automatically when performance stops improving.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=1000,
    validation_fraction=0.2,
    n_iter_no_change=5,
    tol=0.01,
    random_state=42
)
gb.fit(X_tr, y_tr)
print(f"Algorithm stopped at iteration: {gb.n_estimators_}")

Method 3: Regularisation¶

For linear models, apply L1 (Lasso) or L2 (Ridge) penalties to shrink or eliminate coefficients. See the Regularisation Explained page for details.

Workplace Tip

Regularisation is the preferred method for preventing overfitting in linear models; hyperparameter constraints (pruning) are the equivalent for tree-based models.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations