XGBoost & LightGBM¶

The undisputed champions of tabular data — XGBoost and LightGBM dominate Kaggle competitions and production ML systems alike.

What Makes Them Special?¶

Both are gradient boosting frameworks that build trees sequentially, with each new tree correcting the errors of its predecessors. They improve on scikit-learn's GradientBoostingClassifier with:

Speed: Histogram-based splitting and parallel processing make them orders of magnitude faster.
Regularisation: Built-in L1/L2 penalties on leaf weights prevent overfitting.
Missing value handling: Both handle NaN values natively without imputation.
Early stopping: Training halts automatically when validation performance stops improving.

XGBoost¶

from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

xgb = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    eval_metric="logloss",
    random_state=42
)
xgb.fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=False)
print(f"XGBoost Accuracy: {xgb.score(X_te, y_te):.3f}")

LightGBM¶

LightGBM uses a leaf-wise growth strategy (rather than level-wise), which often converges faster and produces better accuracy on large datasets.

from lightgbm import LGBMClassifier

lgb = LGBMClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbose=-1
)
lgb.fit(X_tr, y_tr, eval_set=[(X_te, y_te)])
print(f"LightGBM Accuracy: {lgb.score(X_te, y_te):.3f}")

XGBoost vs LightGBM¶

Aspect	XGBoost	LightGBM
Tree growth	Level-wise	Leaf-wise (faster convergence)
Speed	Fast	Faster on large datasets
Categorical support	Requires encoding	Native categorical support
Community	Larger, more mature	Growing rapidly

Key Hyperparameters¶

Parameter	Effect
`n_estimators`	Number of boosting rounds (trees)
`learning_rate`	Step size for each tree's contribution — lower values need more trees
`max_depth`	Maximum tree depth — controls complexity
`subsample`	Fraction of rows used per tree — adds randomness, reduces overfitting
`colsample_bytree`	Fraction of features used per tree
`reg_alpha` / `reg_lambda`	L1 / L2 regularisation on leaf weights

Workplace Tip

Start with learning_rate=0.1, max_depth=5, and n_estimators=200 with early stopping. Then use Optuna or GridSearchCV to fine-tune. This combination wins more competitions than any other approach on tabular data.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced ML techniques	Tree-based models, ensemble methods, KNN, SVM
K4.4	Trade-offs in selecting algorithms	Comparing parametric vs non-parametric approaches
S4	ML and optimisation	Hyperparameter tuning, ensemble construction, model selection
B1	Curiosity and creativity	Exploring when non-parametric methods outperform parametric ones
B5	Integrity in presenting conclusions	Avoiding overfitting; honest reporting of generalisation performance