Cross-Validation¶
A single train/test split gives you one noisy estimate of model performance. Cross-validation gives you \(k\) estimates, producing a much more reliable picture.
How k-Fold CV Works¶
- Split the data into \(k\) equal folds.
- For each fold: train on the other \(k-1\) folds, test on the held-out fold.
- Average the \(k\) test scores.
Every observation is used for both training and testing exactly once.
Implementation¶
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# StratifiedKFold preserves class proportions in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
RandomForestClassifier(random_state=42),
X, y,
cv=cv,
scoring="accuracy"
)
print(f"Fold scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
Choosing \(k\)¶
| \(k\) | Tradeoff |
|---|---|
| 5 | Good default — reasonable balance between bias and variance |
| 10 | Lower bias, higher variance, more computation |
| \(n\) (LOO) | Lowest bias, highest variance, very expensive |
RepeatedStratifiedKFold¶
For more stable estimates, repeat the k-Fold process multiple times with different random splits:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=cv, scoring="accuracy")
print(f"50-fold Mean: {scores.mean():.4f} ± {scores.std():.4f}")
Common Pitfall
Do not preprocess (e.g., scale, encode) the entire dataset before cross-validation. Fit the preprocessor on the training folds only. Use Pipeline to prevent this leakage.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.4 | Resource constraints and trade-offs | Balancing model complexity, performance, and computational cost |
| S1 | Scientific methods and hypothesis testing | Rigorous cross-validation and statistical model comparison |
| S4 | Building models and validating | Systematic hyperparameter tuning and performance evaluation |
| B5 | Impartial, hypothesis-driven approach | Preventing overfitting; honest reporting of generalisation metrics |