How to Run Regression Diagnostics¶

Regression metrics like \(R^2\) are global summaries. To truly evaluate a continuous model, you must investigate the residuals — the errors left behind after prediction.

What is a Residual?¶

\[\text{Residual} = \text{True Value} - \text{Predicted Value}\]

If an algorithm predicts a house costs £100,000 but the actual value is £110,000, the residual is +£10,000.

A well-fitted model produces residuals that are randomly scattered around zero with no visible pattern.

Plotting Residuals¶

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset("tips")
X = df[["total_bill"]]
y = df["tip"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_tr, y_tr)
preds = lr.predict(X_te)

residuals = y_te - preds

# 1. Residual scatter plot
plt.figure()
sns.scatterplot(x=preds, y=residuals, alpha=0.6)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot (Expected: Random Scatter)")
plt.tight_layout()
plt.show()

# 2. Residual histogram
plt.figure()
sns.histplot(residuals, kde=True)
plt.title("Residual Distribution (Expected: Normal / Gaussian)")
plt.tight_layout()
plt.show()

Interpreting the Plots¶

Heteroscedasticity (Fan Shape): If the residuals widen as predicted values increase, your model's error grows with magnitude. Consider a log transform on the target variable.
Curved Pattern: A systematic curve in the residuals indicates your model is too simple (high bias). Consider polynomial features or a non-linear algorithm.
Normal Distribution: Ideally, residuals follow a bell curve centred on zero. Heavy tails or skew suggest outliers are distorting the fit.

Workplace Tip

Always plot residuals before reporting \(R^2\). A high \(R^2\) can mask systematic patterns that residual analysis will reveal immediately.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations