k-Nearest Neighbours (k-NN)¶

k-NN is a simple, instance-based algorithm: to classify a new observation, it finds the \(k\) closest training points and takes a majority vote.

How It Works¶

Store the entire training dataset in memory (no explicit "training" step).
When a new observation arrives, calculate its distance to every training point.
Select the \(k\) nearest neighbours.
Return the majority class (classification) or the mean value (regression).

There are no learned parameters — the model is the data. This makes k-NN a lazy learner.

Implementation¶

import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

df = sns.load_dataset("penguins").dropna()
X = df[["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]]
y = df["species"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

# CRITICAL: k-NN is distance-based — you MUST standardise features
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_tr_sc, y_tr)
print(classification_report(y_te, knn.predict(X_te_sc)))

Key Considerations¶

Consideration	Detail
Feature scaling	Mandatory — features on larger scales dominate the distance calculation
Curse of dimensionality	Performance degrades rapidly as the number of features grows
Prediction speed	Slow on large datasets (must compute distance to every training point)
No feature importance	k-NN provides no insight into which features drive predictions

Choosing \(k\)¶

Small \(k\) (e.g., 1–3): Highly sensitive to noise, risk of overfitting.
Large \(k\) (e.g., 50+): Over-smoothed boundaries, risk of underfitting.
Use cross-validation to find the optimal value (see How to Choose k).

Common Pitfall

Forgetting to standardise features before fitting k-NN is the single most common mistake. Without scaling, a feature measured in thousands (e.g., salary) will completely dominate one measured in decimals (e.g., GPA).

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced ML techniques	Tree-based models, ensemble methods, KNN, SVM
K4.4	Trade-offs in selecting algorithms	Comparing parametric vs non-parametric approaches
S4	ML and optimisation	Hyperparameter tuning, ensemble construction, model selection
B1	Curiosity and creativity	Exploring when non-parametric methods outperform parametric ones
B5	Integrity in presenting conclusions	Avoiding overfitting; honest reporting of generalisation performance