Cluster Mixed Data Types¶

When data has both numerical and categorical columns, standard distance metrics (Euclidean) break down. Gower's Distance handles both types in a single calculation.

The Problem¶

k-Means uses Euclidean distance, which is undefined for categorical features. You cannot simply label-encode categories and treat them as numbers — the numeric distances between arbitrary codes are meaningless.

Gower's Distance¶

Gower's Distance computes a normalised dissimilarity for each feature pair based on its type:

Numerical: Range-normalised absolute difference.
Categorical: Binary (0 if same category, 1 if different).

The overall distance is the weighted average across all features.

import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering

# pip install gower
import gower

df = pd.DataFrame({
    "age": [25, 35, 45, 30, 50],
    "income": [30000, 50000, 70000, 40000, 80000],
    "region": ["North", "South", "North", "South", "North"],
    "membership": ["Gold", "Silver", "Gold", "Bronze", "Gold"]
})

# Compute Gower distance matrix
dist_matrix = gower.gower_matrix(df)
print(f"Distance matrix shape: {dist_matrix.shape}")

# Cluster using the precomputed distance matrix
model = AgglomerativeClustering(
    n_clusters=2,
    metric="precomputed",
    linkage="average"
)
df["cluster"] = model.fit_predict(dist_matrix)
print(df)

Workplace Tip

For large mixed-type datasets, consider k-Prototypes from the kmodes library, which extends k-Means to handle mixed data directly without computing a full distance matrix.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.2	Advanced analytics and ML techniques	Unsupervised learning algorithms for pattern discovery
K4.4	Trade-offs in selecting algorithms	Choosing between clustering approaches based on data characteristics
S1	Scientific methods and hypothesis testing	Validating cluster quality without ground truth labels
S4	Analysis and models to inform outcomes	Using clustering to derive actionable segments
B1	Inquisitive approach	Exploring hidden structure in unlabelled data