DBSCAN & Density-Based Methods¶
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) finds clusters of arbitrary shape by grouping points in dense regions and labelling sparse points as noise.
How It Works¶
DBSCAN requires two parameters:
| Parameter | Meaning |
|---|---|
eps |
Maximum distance between two points to be considered neighbours |
min_samples |
Minimum number of points required to form a dense region (core point) |
The algorithm classifies each point as:
- Core point: Has at least
min_samplesneighbours withineps. - Border point: Within
epsof a core point but has fewer thanmin_samplesneighbours. - Noise point: Neither core nor border — labelled as
-1.
Implementation¶
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="Set2", edgecolor="k", s=30)
plt.title(f"DBSCAN — {len(set(labels) - {-1})} clusters, {(labels == -1).sum()} noise points")
plt.tight_layout()
plt.show()
Advantages Over k-Means¶
| Feature | k-Means | DBSCAN |
|---|---|---|
| Cluster shape | Spherical only | Arbitrary shapes |
| Requires \(k\) | Yes | No — discovers clusters automatically |
| Noise handling | None — assigns every point | Labels outliers as -1 |
| Density variation | Cannot handle | Struggles with varying densities |
Tuning eps¶
Use a k-distance plot to estimate eps:
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X)
distances, _ = nn.kneighbors(X)
# Sort the distance to the 5th nearest neighbour
sorted_distances = np.sort(distances[:, -1])
plt.figure(figsize=(8, 4))
plt.plot(sorted_distances)
plt.xlabel("Points (sorted)")
plt.ylabel("5th Nearest Neighbour Distance")
plt.title("k-Distance Plot — Look for the Elbow")
plt.tight_layout()
plt.show()
Workplace Tip
DBSCAN excels when your data has natural clusters of irregular shape (e.g., geographic regions, network communities). If clusters have very different densities, consider HDBSCAN instead.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced analytics and ML techniques | Unsupervised learning algorithms for pattern discovery |
| K4.4 | Trade-offs in selecting algorithms | Choosing between clustering approaches based on data characteristics |
| S1 | Scientific methods and hypothesis testing | Validating cluster quality without ground truth labels |
| S4 | Analysis and models to inform outcomes | Using clustering to derive actionable segments |
| B1 | Inquisitive approach | Exploring hidden structure in unlabelled data |