Ensemble Theory Explained¶
"The wisdom of the crowd" — aggregating many weak models into one strong model is almost always superior to relying on a single algorithm.
The Core Idea¶
A single Decision Tree is noisy and unstable: small changes in the training data can produce a completely different tree. However, if you train 100 slightly different trees and let them vote, individual errors cancel out and the collective prediction stabilises.
This principle underpins all ensemble methods.
Bagging (Bootstrap Aggregating)¶
Each base model is trained on a random bootstrap sample (sampling with replacement) of the original dataset. Predictions are combined by majority vote (classification) or averaging (regression).
- Key algorithm:
RandomForestClassifier/RandomForestRegressor - Effect: Reduces variance without increasing bias.
from sklearn.ensemble import RandomForestClassifier
# 100 independent trees, each trained on a bootstrapped sample
rf = RandomForestClassifier(n_estimators=100, random_state=42)
Boosting (Sequential Correction)¶
Each base model is trained sequentially, with each new model focusing on the mistakes of its predecessor. This progressively reduces bias.
- Key algorithms:
GradientBoostingClassifier,XGBClassifier,LGBMClassifier - Effect: Reduces bias, but can overfit if not regularised (use
learning_rate,max_depth,n_estimators).
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3)
Stacking¶
Multiple heterogeneous models (e.g., a Logistic Regression, a Random Forest, and an SVM) each make predictions. A meta-learner then combines those predictions into a final output.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
stack = StackingClassifier(
estimators=[("rf", RandomForestClassifier()), ("svm", SVC())],
final_estimator=LogisticRegression()
)
Workplace Tip
In production, Gradient Boosting frameworks (XGBoost, LightGBM) dominate tabular data competitions and real-world deployments because they combine high accuracy with built-in regularisation controls.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.1 | Statistical models and methods | Understanding the statistical basis of regression and classification |
| K4.2 | ML and AI techniques | Implementing and comparing supervised learning algorithms |
| K4.4 | Resource constraints and trade-offs | Model complexity vs interpretability; computational cost |
| S1 | Scientific methods and hypothesis testing | Formulating hypotheses and testing with rigorous validation |
| S4 | Building models and validating | Cross-validation, train/test evaluation, performance metrics |
| B5 | Impartial, hypothesis-driven approach | Honest evaluation of model performance and limitations |