Skip to content

How to Prevent Overfitting

Overfitting occurs when your algorithm memorises the noise and anomalies in your training set, then fails on any unseen test data.

Method 1: Hyperparameter Constraints (Pruning)

Unbounded algorithms (Decision Trees, Neural Networks) will expand indefinitely until training error hits zero. You must restrict their growth.

import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

df = sns.load_dataset("titanic").dropna(subset=["age", "fare", "survived"])
X = df[["age", "fare"]]
y = df["survived"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

# Unbound Tree (Overfitting)
bad_tree = DecisionTreeClassifier(random_state=42)
bad_tree.fit(X_tr, y_tr)
print(f"Unbound Train Accuracy: {bad_tree.score(X_tr, y_tr):.2f}")
print(f"Unbound Test Accuracy:  {bad_tree.score(X_te, y_te):.2f}")
Expected Output
Unbound Train Accuracy: 0.99
Unbound Test Accuracy:  0.61

The unbound tree memorised the training noise (99% vs 61% — a huge gap).

Now constrain it using max_depth (limits tree depth) and min_samples_split (requires a minimum number of samples to split a node):

good_tree = DecisionTreeClassifier(max_depth=4, min_samples_split=10, random_state=42)
good_tree.fit(X_tr, y_tr)
print(f"Constrained Train Accuracy: {good_tree.score(X_tr, y_tr):.2f}")
print(f"Constrained Test Accuracy:  {good_tree.score(X_te, y_te):.2f}")
Expected Output
Constrained Train Accuracy: 0.72
Constrained Test Accuracy:  0.69

The scores are now closely aligned, proving the algorithm generalises rather than memorises.

Method 2: Early Stopping (Iterative Models)

For Gradient Boosting or Neural Networks, you can monitor validation error during training and halt automatically when performance stops improving.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=1000,
    validation_fraction=0.2,
    n_iter_no_change=5,
    tol=0.01,
    random_state=42
)
gb.fit(X_tr, y_tr)
print(f"Algorithm stopped at iteration: {gb.n_estimators_}")

Method 3: Regularisation

For linear models, apply L1 (Lasso) or L2 (Ridge) penalties to shrink or eliminate coefficients. See the Regularisation Explained page for details.

Workplace Tip

Regularisation is the preferred method for preventing overfitting in linear models; hyperparameter constraints (pruning) are the equivalent for tree-based models.

KSB Mapping

KSB Description How This Addresses It
K4.1 Statistical models and methods Understanding the statistical basis of regression and classification
K4.2 ML and AI techniques Implementing and comparing supervised learning algorithms
K4.4 Resource constraints and trade-offs Model complexity vs interpretability; computational cost
S1 Scientific methods and hypothesis testing Formulating hypotheses and testing with rigorous validation
S4 Building models and validating Cross-validation, train/test evaluation, performance metrics
B5 Impartial, hypothesis-driven approach Honest evaluation of model performance and limitations