How to Choose the Right \(k\) Value¶
The \(k\) hyperparameter in k-Nearest Neighbours dictates whether your algorithm overfits (low \(k\)) or underfits (high \(k\)). Choosing the right value is critical.
The Tradeoff¶
| \(k\) Value | Behaviour | Risk |
|---|---|---|
| Low (e.g., 1–3) | Highly sensitive to individual data points | Overfitting — noisy, jagged decision boundaries |
| High (e.g., 50+) | Over-smoothed, ignores local patterns | Underfitting — the model defaults to majority class |
The Elbow Method¶
Sweep a range of \(k\) values using cross-validation and plot accuracy against \(k\). Choose the value where accuracy plateaus — the "elbow" of the curve.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
k_range = range(1, 31)
scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=5).mean()
for k in k_range]
plt.figure(figsize=(8, 4))
plt.plot(k_range, scores, marker="o")
plt.xlabel("k")
plt.ylabel("Cross-Validated Accuracy")
plt.title("Elbow Plot for k Selection")
plt.tight_layout()
plt.show()
Using GridSearchCV¶
For a more automated approach, let scikit-learn select the best \(k\) for you:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
KNeighborsClassifier(),
param_grid={"n_neighbors": [3, 5, 7, 9, 11, 15, 21]},
cv=5,
scoring="accuracy"
)
grid.fit(X, y)
print(f"Best k: {grid.best_params_['n_neighbors']}")
print(f"Best CV Accuracy: {grid.best_score_:.3f}")
Workplace Tip
Always use an odd value for \(k\) in binary classification to avoid tied votes.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced ML techniques | Tree-based models, ensemble methods, KNN, SVM |
| K4.4 | Trade-offs in selecting algorithms | Comparing parametric vs non-parametric approaches |
| S4 | ML and optimisation | Hyperparameter tuning, ensemble construction, model selection |
| B1 | Curiosity and creativity | Exploring when non-parametric methods outperform parametric ones |
| B5 | Integrity in presenting conclusions | Avoiding overfitting; honest reporting of generalisation performance |