XGBoost & LightGBM¶
The undisputed champions of tabular data — XGBoost and LightGBM dominate Kaggle competitions and production ML systems alike.
What Makes Them Special?¶
Both are gradient boosting frameworks that build trees sequentially, with each new tree correcting the errors of its predecessors. They improve on scikit-learn's GradientBoostingClassifier with:
- Speed: Histogram-based splitting and parallel processing make them orders of magnitude faster.
- Regularisation: Built-in L1/L2 penalties on leaf weights prevent overfitting.
- Missing value handling: Both handle
NaNvalues natively without imputation. - Early stopping: Training halts automatically when validation performance stops improving.
XGBoost¶
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
xgb = XGBClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
eval_metric="logloss",
random_state=42
)
xgb.fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=False)
print(f"XGBoost Accuracy: {xgb.score(X_te, y_te):.3f}")
LightGBM¶
LightGBM uses a leaf-wise growth strategy (rather than level-wise), which often converges faster and produces better accuracy on large datasets.
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
random_state=42,
verbose=-1
)
lgb.fit(X_tr, y_tr, eval_set=[(X_te, y_te)])
print(f"LightGBM Accuracy: {lgb.score(X_te, y_te):.3f}")
XGBoost vs LightGBM¶
| Aspect | XGBoost | LightGBM |
|---|---|---|
| Tree growth | Level-wise | Leaf-wise (faster convergence) |
| Speed | Fast | Faster on large datasets |
| Categorical support | Requires encoding | Native categorical support |
| Community | Larger, more mature | Growing rapidly |
Key Hyperparameters¶
| Parameter | Effect |
|---|---|
n_estimators |
Number of boosting rounds (trees) |
learning_rate |
Step size for each tree's contribution — lower values need more trees |
max_depth |
Maximum tree depth — controls complexity |
subsample |
Fraction of rows used per tree — adds randomness, reduces overfitting |
colsample_bytree |
Fraction of features used per tree |
reg_alpha / reg_lambda |
L1 / L2 regularisation on leaf weights |
Workplace Tip
Start with learning_rate=0.1, max_depth=5, and n_estimators=200 with early stopping. Then use Optuna or GridSearchCV to fine-tune. This combination wins more competitions than any other approach on tabular data.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced ML techniques | Tree-based models, ensemble methods, KNN, SVM |
| K4.4 | Trade-offs in selecting algorithms | Comparing parametric vs non-parametric approaches |
| S4 | ML and optimisation | Hyperparameter tuning, ensemble construction, model selection |
| B1 | Curiosity and creativity | Exploring when non-parametric methods outperform parametric ones |
| B5 | Integrity in presenting conclusions | Avoiding overfitting; honest reporting of generalisation performance |