Gradient Boosting¶

While a Random Forest trains 100 independent trees simultaneously, Gradient Boosting trains 100 trees sequentially. Each new tree systematically attempts to correct the specific errors of the previous tree.

What You Will Learn¶

Differentiate Bagging (Random Forest) from Boosting
Train a GradientBoostingClassifier
Observe staged learning progression

Prerequisites¶

Completed the Random Forests module

Step 1: The Intuition of Boosting¶

In Random Forests, if Tree #1 is completely wrong about Row 5, Tree #2 does not care. They train independently.

In Gradient Boosting, the algorithm trains sequentially: 1. Tree 1 calculates a prediction for all rows. It registers a large error on Row 5. 2. Tree 2 ignores the original target. Instead, Tree 2 spends 100% of its effort trying to predict the error size of Tree 1 on Row 5. 3. Tree 3 inspects the combined error of Tree 1 + Tree 2, and focuses its splits on the remaining residuals.

Step 2: Implementation¶

Scikit-Learn contains a built-in GradientBoostingClassifier.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

# 1. Synthesize non-linear data
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 2. Instantiate and train 
# learning_rate controls how aggressively each tree alters the pipeline.
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=2, random_state=42)
gbc.fit(X_train, y_train)

preds = gbc.predict(X_test)
print(f"Algorithm Accuracy: {accuracy_score(y_test, preds):.2f}")

Expected Output

Algorithm Accuracy: 0.88

Step 3: Visualising Convergence¶

Because Gradient Boosting corrects errors dynamically, we can chart the training loss decreasing steadily as more trees enter the pipeline.

# Extract the staged pseudo-loss
train_loss = np.zeros(100)
test_loss = np.zeros(100)

for i, y_pred in enumerate(gbc.staged_predict_proba(X_train)):
    train_loss[i] = 1 - accuracy_score(y_train, np.argmax(y_pred, axis=1))

for i, y_pred in enumerate(gbc.staged_predict_proba(X_test)):
    test_loss[i] = 1 - accuracy_score(y_test, np.argmax(y_pred, axis=1))

plt.figure(figsize=(8, 5))
plt.plot(np.arange(100) + 1, train_loss, 'b-', label='Training Error')
plt.plot(np.arange(100) + 1, test_loss, 'r-', label='Validation Error')
plt.title('Gradient Boosting Convergence (Trees vs Error)')
plt.xlabel('Boosting Iterations (Trees)')
plt.ylabel('Misclassification Error')
plt.legend()
plt.tight_layout()
plt.show()

Expected Plot

Gradient Convergence

Unlike Random Forests, adding infinite trees to a Boosting algorithm will explicitly cause severe overfitting. You must strategically tune n_estimators using Early Stopping logic to halt training when test_loss begins increasing.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations