Ensemble Theory Explained¶

"The wisdom of the crowd" — aggregating many weak models into one strong model is almost always superior to relying on a single algorithm.

The Core Idea¶

A single Decision Tree is noisy and unstable: small changes in the training data can produce a completely different tree. However, if you train 100 slightly different trees and let them vote, individual errors cancel out and the collective prediction stabilises.

This principle underpins all ensemble methods.

Bagging (Bootstrap Aggregating)¶

Each base model is trained on a random bootstrap sample (sampling with replacement) of the original dataset. Predictions are combined by majority vote (classification) or averaging (regression).

Key algorithm: RandomForestClassifier / RandomForestRegressor
Effect: Reduces variance without increasing bias.

from sklearn.ensemble import RandomForestClassifier

# 100 independent trees, each trained on a bootstrapped sample
rf = RandomForestClassifier(n_estimators=100, random_state=42)

Boosting (Sequential Correction)¶

Each base model is trained sequentially, with each new model focusing on the mistakes of its predecessor. This progressively reduces bias.

Key algorithms: GradientBoostingClassifier, XGBClassifier, LGBMClassifier
Effect: Reduces bias, but can overfit if not regularised (use learning_rate, max_depth, n_estimators).

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3)

Stacking¶

Multiple heterogeneous models (e.g., a Logistic Regression, a Random Forest, and an SVM) each make predictions. A meta-learner then combines those predictions into a final output.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

stack = StackingClassifier(
    estimators=[("rf", RandomForestClassifier()), ("svm", SVC())],
    final_estimator=LogisticRegression()
)

Workplace Tip

In production, Gradient Boosting frameworks (XGBoost, LightGBM) dominate tabular data competitions and real-world deployments because they combine high accuracy with built-in regularisation controls.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations