The Bias-Variance Tradeoff in Modelling¶

In predictive modelling, Bias and Variance represent a mathematical scale. You must balance the algorithm to sit in the middle.

High Bias (Underfitting)¶

An algorithm with high bias (e.g., Linear Regression on a non-linear problem) makes overly simplistic assumptions and ignores genuine complexity in the data.

Symptom: The model scores poorly on both the Training data and the Test data.
Analogy: Studying only Chapter 1 for a ten-chapter final exam — you lack the depth to answer most questions.

High Variance (Overfitting)¶

An algorithm with high variance (e.g., an unbound Decision Tree) memorises every quirk and noise artefact in the training set, then fails on anything new.

Symptom: Scores brilliantly on Training, but drastically poorly on Validation/Test.
Analogy: Memorising the exact wording of past exam papers rather than learning the underlying principles — any rephrased question defeats you.

The Sweet Spot¶

The goal of model tuning is to find the point where the combined error from bias and variance is minimised. You achieve this by:

Increasing model complexity (reducing bias) — e.g., adding polynomial features, increasing tree depth.
Applying regularisation (reducing variance) — e.g., L1/L2 penalties, max_depth limits, early stopping.
Using cross-validation to measure generalisation performance at each complexity level.

from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
import numpy as np

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)

# Sweep max_depth to observe bias-variance tradeoff
train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42), X, y,
    param_name="max_depth", param_range=range(1, 20),
    cv=5, scoring="accuracy"
)

When you plot these curves, the gap between training and validation scores reveals the variance; a low overall score reveals bias.

Assessment Connection

Explicitly discussing the bias-variance tradeoff in your EPA demonstrates you understand why you tuned hyperparameters, not just how.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations