k-Means Clustering¶

The foundational partitioning algorithm that assigns each data point to the nearest of \(k\) centroids, then iteratively refines centroid positions until convergence.

How It Works¶

Randomly initialise \(k\) centroids.
Assign each point to its nearest centroid (Euclidean distance).
Update each centroid to the mean of its assigned points.
Repeat steps 2–3 until centroids stop moving (convergence).

Implementation¶

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic clustered data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Fit k-Means
km = KMeans(n_clusters=4, random_state=42, n_init="auto")
labels = km.fit_predict(X)

# Visualise
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="Set2", edgecolor="k", s=30)
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
            c="red", marker="X", s=200, label="Centroids")
plt.title("k-Means Clustering (k=4)")
plt.legend()
plt.tight_layout()
plt.show()

Key Properties¶

Property	Detail
Cluster shape	Assumes spherical, equally-sized clusters
Requires \(k\)	You must specify the number of clusters in advance
Sensitivity	Sensitive to initialisation — use `n_init="auto"` for multiple runs
Feature scaling	Mandatory — Euclidean distance is scale-dependent
Convergence	Always converges, but may find a local minimum

Limitations¶

Cannot detect clusters of arbitrary shape (use DBSCAN for that).
Outliers pull centroids away from true cluster centres.
All clusters tend towards equal size due to Voronoi partitioning.

Workplace Tip

k-Means is your first port of call for clustering. It is fast, well-understood, and works well when clusters are roughly spherical. Always standardise features first with StandardScaler.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced analytics and ML techniques	Unsupervised learning algorithms for pattern discovery
K4.4	Trade-offs in selecting algorithms	Choosing between clustering approaches based on data characteristics
S1	Scientific methods and hypothesis testing	Validating cluster quality without ground truth labels
S4	Analysis and models to inform outcomes	Using clustering to derive actionable segments
B1	Inquisitive approach	Exploring hidden structure in unlabelled data