How to Prune and Regularise Decision Trees¶
Preventing a Decision Tree from memorising noise requires constraining its growth — either before training (pre-pruning) or after (post-pruning).
Pre-Pruning (Constraint-Based)¶
Set hyperparameters that stop the tree from growing too deep during training:
| Parameter | Effect |
|---|---|
max_depth |
Limits the maximum number of levels in the tree |
min_samples_split |
Requires a minimum number of samples to split a node |
min_samples_leaf |
Requires a minimum number of samples in each leaf node |
max_features |
Limits the number of features considered at each split |
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
# Pre-pruned tree
model = DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=42)
model.fit(X_tr, y_tr)
print(f"Train: {model.score(X_tr, y_tr):.2f} Test: {model.score(X_te, y_te):.2f}")
Post-Pruning (Cost-Complexity / ccp_alpha)¶
Post-pruning trains a full tree first, then removes branches that contribute less than a threshold (ccp_alpha) to overall accuracy. Higher ccp_alpha → more aggressive pruning.
import matplotlib.pyplot as plt
# Find the optimal ccp_alpha via cross-validation
path = DecisionTreeClassifier(random_state=42).cost_complexity_pruning_path(X_tr, y_tr)
alphas = path.ccp_alphas
from sklearn.model_selection import cross_val_score
import numpy as np
scores = [cross_val_score(DecisionTreeClassifier(ccp_alpha=a, random_state=42),
X_tr, y_tr, cv=5).mean() for a in alphas]
best_alpha = alphas[np.argmax(scores)]
print(f"Best ccp_alpha: {best_alpha:.4f}")
# Train final pruned tree
pruned = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
pruned.fit(X_tr, y_tr)
print(f"Pruned Test Accuracy: {pruned.score(X_te, y_te):.2f}")
Workplace Tip
Pre-pruning is simpler and faster; post-pruning (ccp_alpha) is more principled. In practice, use GridSearchCV to tune either set of parameters systematically.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced ML techniques | Tree-based models, ensemble methods, KNN, SVM |
| K4.4 | Trade-offs in selecting algorithms | Comparing parametric vs non-parametric approaches |
| S4 | ML and optimisation | Hyperparameter tuning, ensemble construction, model selection |
| B1 | Curiosity and creativity | Exploring when non-parametric methods outperform parametric ones |
| B5 | Integrity in presenting conclusions | Avoiding overfitting; honest reporting of generalisation performance |