How to Run Regression Diagnostics¶
Regression metrics like \(R^2\) are global summaries. To truly evaluate a continuous model, you must investigate the residuals — the errors left behind after prediction.
What is a Residual?¶
\[\text{Residual} = \text{True Value} - \text{Predicted Value}\]
If an algorithm predicts a house costs £100,000 but the actual value is £110,000, the residual is +£10,000.
A well-fitted model produces residuals that are randomly scattered around zero with no visible pattern.
Plotting Residuals¶
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = sns.load_dataset("tips")
X = df[["total_bill"]]
y = df["tip"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_tr, y_tr)
preds = lr.predict(X_te)
residuals = y_te - preds
# 1. Residual scatter plot
plt.figure()
sns.scatterplot(x=preds, y=residuals, alpha=0.6)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot (Expected: Random Scatter)")
plt.tight_layout()
plt.show()
# 2. Residual histogram
plt.figure()
sns.histplot(residuals, kde=True)
plt.title("Residual Distribution (Expected: Normal / Gaussian)")
plt.tight_layout()
plt.show()
Interpreting the Plots¶
- Heteroscedasticity (Fan Shape): If the residuals widen as predicted values increase, your model's error grows with magnitude. Consider a log transform on the target variable.
- Curved Pattern: A systematic curve in the residuals indicates your model is too simple (high bias). Consider polynomial features or a non-linear algorithm.
- Normal Distribution: Ideally, residuals follow a bell curve centred on zero. Heavy tails or skew suggest outliers are distorting the fit.
Workplace Tip
Always plot residuals before reporting \(R^2\). A high \(R^2\) can mask systematic patterns that residual analysis will reveal immediately.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.1 | Statistical models and methods | Understanding the statistical basis of regression and classification |
| K4.2 | ML and AI techniques | Implementing and comparing supervised learning algorithms |
| K4.4 | Resource constraints and trade-offs | Model complexity vs interpretability; computational cost |
| S1 | Scientific methods and hypothesis testing | Formulating hypotheses and testing with rigorous validation |
| S4 | Building models and validating | Cross-validation, train/test evaluation, performance metrics |
| B5 | Impartial, hypothesis-driven approach | Honest evaluation of model performance and limitations |