Skip to content

DBSCAN & Density-Based Methods

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) finds clusters of arbitrary shape by grouping points in dense regions and labelling sparse points as noise.

How It Works

DBSCAN requires two parameters:

Parameter Meaning
eps Maximum distance between two points to be considered neighbours
min_samples Minimum number of points required to form a dense region (core point)

The algorithm classifies each point as:

  • Core point: Has at least min_samples neighbours within eps.
  • Border point: Within eps of a core point but has fewer than min_samples neighbours.
  • Noise point: Neither core nor border — labelled as -1.

Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="Set2", edgecolor="k", s=30)
plt.title(f"DBSCAN — {len(set(labels) - {-1})} clusters, {(labels == -1).sum()} noise points")
plt.tight_layout()
plt.show()

Advantages Over k-Means

Feature k-Means DBSCAN
Cluster shape Spherical only Arbitrary shapes
Requires \(k\) Yes No — discovers clusters automatically
Noise handling None — assigns every point Labels outliers as -1
Density variation Cannot handle Struggles with varying densities

Tuning eps

Use a k-distance plot to estimate eps:

from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5)
nn.fit(X)
distances, _ = nn.kneighbors(X)

# Sort the distance to the 5th nearest neighbour
sorted_distances = np.sort(distances[:, -1])

plt.figure(figsize=(8, 4))
plt.plot(sorted_distances)
plt.xlabel("Points (sorted)")
plt.ylabel("5th Nearest Neighbour Distance")
plt.title("k-Distance Plot — Look for the Elbow")
plt.tight_layout()
plt.show()

Workplace Tip

DBSCAN excels when your data has natural clusters of irregular shape (e.g., geographic regions, network communities). If clusters have very different densities, consider HDBSCAN instead.

KSB Mapping

KSB Description How This Addresses It
K4.2 Advanced analytics and ML techniques Unsupervised learning algorithms for pattern discovery
K4.4 Trade-offs in selecting algorithms Choosing between clustering approaches based on data characteristics
S1 Scientific methods and hypothesis testing Validating cluster quality without ground truth labels
S4 Analysis and models to inform outcomes Using clustering to derive actionable segments
B1 Inquisitive approach Exploring hidden structure in unlabelled data