Skip to content

Machine Learning

Prune and Regularise Trees

bpp-sot/l6ds-m9-machine-learning

How to Prune and Regularise Decision Trees¶

Preventing a Decision Tree from memorising noise requires constraining its growth — either before training (pre-pruning) or after (post-pruning).

Pre-Pruning (Constraint-Based)¶

Set hyperparameters that stop the tree from growing too deep during training:

Parameter	Effect
`max_depth`	Limits the maximum number of levels in the tree
`min_samples_split`	Requires a minimum number of samples to split a node
`min_samples_leaf`	Requires a minimum number of samples in each leaf node
`max_features`	Limits the number of features considered at each split

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

# Pre-pruned tree
model = DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=42)
model.fit(X_tr, y_tr)
print(f"Train: {model.score(X_tr, y_tr):.2f}  Test: {model.score(X_te, y_te):.2f}")

Post-Pruning (Cost-Complexity / `ccp_alpha`)¶

Post-pruning trains a full tree first, then removes branches that contribute less than a threshold (ccp_alpha) to overall accuracy. Higher ccp_alpha → more aggressive pruning.

import matplotlib.pyplot as plt

# Find the optimal ccp_alpha via cross-validation
path = DecisionTreeClassifier(random_state=42).cost_complexity_pruning_path(X_tr, y_tr)
alphas = path.ccp_alphas

from sklearn.model_selection import cross_val_score
import numpy as np

scores = [cross_val_score(DecisionTreeClassifier(ccp_alpha=a, random_state=42),
                          X_tr, y_tr, cv=5).mean() for a in alphas]

best_alpha = alphas[np.argmax(scores)]
print(f"Best ccp_alpha: {best_alpha:.4f}")

# Train final pruned tree
pruned = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
pruned.fit(X_tr, y_tr)
print(f"Pruned Test Accuracy: {pruned.score(X_te, y_te):.2f}")

Workplace Tip

Pre-pruning is simpler and faster; post-pruning (ccp_alpha) is more principled. In practice, use GridSearchCV to tune either set of parameters systematically.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced ML techniques	Tree-based models, ensemble methods, KNN, SVM
K4.4	Trade-offs in selecting algorithms	Comparing parametric vs non-parametric approaches
S4	ML and optimisation	Hyperparameter tuning, ensemble construction, model selection
B1	Curiosity and creativity	Exploring when non-parametric methods outperform parametric ones
B5	Integrity in presenting conclusions	Avoiding overfitting; honest reporting of generalisation performance