Learning Curves¶

Learning curves plot model performance against training set size. They diagnose whether your model suffers from high bias (underfitting) or high variance (overfitting).

How to Read Them¶

Pattern	Training Score	Validation Score	Diagnosis	Fix
Both low	Low	Low	High bias (underfitting)	Use a more complex model or add features
Big gap	High	Low	High variance (overfitting)	Get more data, regularise, or simplify the model
Both high, converging	High	High (close)	Good fit	You're done

Implementation¶

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(random_state=42),
    X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(8, 5))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color="blue")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color="orange")
plt.plot(train_sizes, train_mean, "o-", label="Training Score", color="blue")
plt.plot(train_sizes, val_mean, "o-", label="Validation Score", color="orange")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.title("Learning Curve")
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

Interpreting Results¶

Converging curves with a small gap: Your model generalises well. More data is unlikely to help.
Large gap that shrinks with more data: High variance — the model overfits but more data will help.
Both curves plateau at a low score: High bias — a more powerful model or better features are needed.

Workplace Tip

Always plot a learning curve before asking for more data. If the curves have already converged, collecting more data will not improve performance — you need a better model or better features.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.4	Resource constraints and trade-offs	Balancing model complexity, performance, and computational cost
S1	Scientific methods and hypothesis testing	Rigorous cross-validation and statistical model comparison
S4	Building models and validating	Systematic hyperparameter tuning and performance evaluation
B5	Impartial, hypothesis-driven approach	Preventing overfitting; honest reporting of generalisation metrics