Skip to content

How to Compare Models Statistically

Comparing mean CV scores is not enough — you need a statistical test to determine whether the difference between two models is significant.

The Problem

Model A has a mean CV accuracy of 0.87 and Model B has 0.85. Is A genuinely better, or is the difference just noise from the random fold splits?

Solution: Paired t-Test on CV Folds

By using the same cross-validation folds for both models, each fold produces a paired observation. A paired t-test then determines whether the mean difference is statistically significant.

import numpy as np
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from scipy.stats import ttest_rel

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

scores_rf = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=cv, scoring="accuracy")
scores_gb = cross_val_score(GradientBoostingClassifier(random_state=42), X, y, cv=cv, scoring="accuracy")

print(f"RF Mean: {scores_rf.mean():.4f} ± {scores_rf.std():.4f}")
print(f"GB Mean: {scores_gb.mean():.4f} ± {scores_gb.std():.4f}")

# Paired t-test
stat, p_value = ttest_rel(scores_rf, scores_gb)
print(f"\nPaired t-test p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference — choose the model with the higher mean.")
else:
    print("No significant difference — models are statistically equivalent.")

Interpretation

p-value Conclusion
< 0.05 The difference is statistically significant
≥ 0.05 No evidence the models differ — prefer the simpler one

Common Pitfall

You must use the same CV folds for both models. Using different random splits invalidates the paired comparison.

KSB Mapping

KSB Description How This Addresses It
K4.4 Resource constraints and trade-offs Balancing model complexity, performance, and computational cost
S1 Scientific methods and hypothesis testing Rigorous cross-validation and statistical model comparison
S4 Building models and validating Systematic hyperparameter tuning and performance evaluation
B5 Impartial, hypothesis-driven approach Preventing overfitting; honest reporting of generalisation metrics