Hierarchical Clustering¶

Bottom-up agglomerative clustering builds a tree (dendrogram) by iteratively merging the two closest clusters until only one remains.

How It Works¶

Start with each data point as its own cluster.
Find the two closest clusters and merge them.
Repeat until all points belong to a single cluster.
Cut the dendrogram at the desired height to obtain your clusters.

The linkage criterion determines how "closeness" between clusters is measured:

Linkage	Measures Distance Between
`ward`	Minimises variance increase when merging (default, produces compact clusters)
`complete`	Farthest points in each cluster
`average`	Mean distance between all point pairs
`single`	Nearest points in each cluster (prone to chaining)

Implementation¶

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage

X, _ = make_blobs(n_samples=100, centers=3, cluster_std=0.8, random_state=42)

# 1. Plot the Dendrogram
Z = linkage(X, method="ward")
plt.figure(figsize=(10, 5))
dendrogram(Z, truncate_mode="level", p=5)
plt.title("Dendrogram (Ward Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()

# 2. Fit Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels = agg.fit_predict(X)

plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="Set2", edgecolor="k", s=40)
plt.title("Agglomerative Clustering (3 Clusters)")
plt.tight_layout()
plt.show()

When to Use Hierarchical Clustering¶

When you want a visual overview of how clusters relate (dendrogram).
When you do not know the optimal number of clusters in advance — the dendrogram helps you decide.
When cluster sizes are small to medium (hierarchical clustering scales as \(O(n^2)\) in memory).

Common Pitfall

Hierarchical clustering does not scale well. For datasets larger than ~10,000 points, prefer k-Means or DBSCAN.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced analytics and ML techniques	Unsupervised learning algorithms for pattern discovery
K4.4	Trade-offs in selecting algorithms	Choosing between clustering approaches based on data characteristics
S1	Scientific methods and hypothesis testing	Validating cluster quality without ground truth labels
S4	Analysis and models to inform outcomes	Using clustering to derive actionable segments
B1	Inquisitive approach	Exploring hidden structure in unlabelled data