How to Compute Confidence Intervals for Model Performance¶
A single accuracy number is meaningless without a confidence interval. Report the range in which the true performance likely falls.
Why Confidence Intervals?¶
A model scoring 0.85 accuracy on one test set might score 0.82 or 0.88 on a different split. Confidence intervals quantify this uncertainty.
Method: Bootstrap Resampling¶
Repeatedly resample predictions with replacement and compute the metric on each sample to build a distribution.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(random_state=42).fit(X_tr, y_tr)
preds = model.predict(X_te)
# Bootstrap confidence interval
rng = np.random.default_rng(42)
n_bootstrap = 1000
scores = []
for _ in range(n_bootstrap):
idx = rng.choice(len(y_te), size=len(y_te), replace=True)
scores.append(accuracy_score(y_te.iloc[idx] if hasattr(y_te, 'iloc') else y_te[idx],
preds[idx]))
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
print(f"Accuracy: {np.mean(scores):.4f}")
print(f"95% CI: [{lower:.4f}, {upper:.4f}]")
Method: Cross-Validation Interval¶
A simpler (less rigorous) approach uses the standard deviation across CV folds:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=10, scoring="accuracy")
mean = scores.mean()
ci = 1.96 * scores.std() # Approximate 95% CI
print(f"Accuracy: {mean:.4f} ± {ci:.4f}")
print(f"95% CI: [{mean - ci:.4f}, {mean + ci:.4f}]")
Workplace Tip
Always report model performance as a range, not a point estimate. Stakeholders and EPA assessors will be more convinced by "accuracy of 0.85 ± 0.03" than "accuracy of 0.85".
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.4 | Resource constraints and trade-offs | Balancing model complexity, performance, and computational cost |
| S1 | Scientific methods and hypothesis testing | Rigorous cross-validation and statistical model comparison |
| S4 | Building models and validating | Systematic hyperparameter tuning and performance evaluation |
| B5 | Impartial, hypothesis-driven approach | Preventing overfitting; honest reporting of generalisation metrics |