How to Compare Models Statistically¶

Comparing mean CV scores is not enough — you need a statistical test to determine whether the difference between two models is significant.

The Problem¶

Model A has a mean CV accuracy of 0.87 and Model B has 0.85. Is A genuinely better, or is the difference just noise from the random fold splits?

Solution: Paired t-Test on CV Folds¶

By using the same cross-validation folds for both models, each fold produces a paired observation. A paired t-test then determines whether the mean difference is statistically significant.

import numpy as np
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from scipy.stats import ttest_rel

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

scores_rf = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=cv, scoring="accuracy")
scores_gb = cross_val_score(GradientBoostingClassifier(random_state=42), X, y, cv=cv, scoring="accuracy")

print(f"RF Mean: {scores_rf.mean():.4f} ± {scores_rf.std():.4f}")
print(f"GB Mean: {scores_gb.mean():.4f} ± {scores_gb.std():.4f}")

# Paired t-test
stat, p_value = ttest_rel(scores_rf, scores_gb)
print(f"\nPaired t-test p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference — choose the model with the higher mean.")
else:
    print("No significant difference — models are statistically equivalent.")

Interpretation¶

p-value	Conclusion
< 0.05	The difference is statistically significant
≥ 0.05	No evidence the models differ — prefer the simpler one

Common Pitfall

You must use the same CV folds for both models. Using different random splits invalidates the paired comparison.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.4	Resource constraints and trade-offs	Balancing model complexity, performance, and computational cost
S1	Scientific methods and hypothesis testing	Rigorous cross-validation and statistical model comparison
S4	Building models and validating	Systematic hyperparameter tuning and performance evaluation
B5	Impartial, hypothesis-driven approach	Preventing overfitting; honest reporting of generalisation metrics