Skip to content

Learning Curves

Learning curves plot model performance against training set size. They diagnose whether your model suffers from high bias (underfitting) or high variance (overfitting).

How to Read Them

Pattern Training Score Validation Score Diagnosis Fix
Both low Low Low High bias (underfitting) Use a more complex model or add features
Big gap High Low High variance (overfitting) Get more data, regularise, or simplify the model
Both high, converging High High (close) Good fit You're done

Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(random_state=42),
    X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(8, 5))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color="blue")
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color="orange")
plt.plot(train_sizes, train_mean, "o-", label="Training Score", color="blue")
plt.plot(train_sizes, val_mean, "o-", label="Validation Score", color="orange")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.title("Learning Curve")
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

Interpreting Results

  • Converging curves with a small gap: Your model generalises well. More data is unlikely to help.
  • Large gap that shrinks with more data: High variance — the model overfits but more data will help.
  • Both curves plateau at a low score: High bias — a more powerful model or better features are needed.

Workplace Tip

Always plot a learning curve before asking for more data. If the curves have already converged, collecting more data will not improve performance — you need a better model or better features.

KSB Mapping

KSB Description How This Addresses It
K4.4 Resource constraints and trade-offs Balancing model complexity, performance, and computational cost
S1 Scientific methods and hypothesis testing Rigorous cross-validation and statistical model comparison
S4 Building models and validating Systematic hyperparameter tuning and performance evaluation
B5 Impartial, hypothesis-driven approach Preventing overfitting; honest reporting of generalisation metrics