Clustering

Clustering is an unsupervised learning task that groups data points so that points within the same group are more similar to each other than to points in other groups. Unlike classification, there are no labels—the algorithm discovers structure in unlabeled data.

What Is Clustering

Clustering algorithms partition data into groups (clusters) based on similarity. The definition of similarity depends on the distance metric and the clustering method, but the goal is always to uncover natural groupings in the data.

Common use cases:

Customer segmentation for marketing
Anomaly detection by identifying outliers
Data exploration and visualization
Document or image organization
Compression by representing groups with centroids

Types of Clustering

Partitioning Methods

Divide data into K non-overlapping groups. Each point belongs to exactly one cluster.

Examples: K-Means, K-Medoids

Characteristics: Fast, scalable, require specifying number of clusters upfront. Work best with spherical clusters of similar size.

Hierarchical Methods

Build a tree of clusters by either merging smaller clusters (agglomerative) or splitting larger ones (divisive). Produces a dendrogram showing relationships at multiple resolutions.

Characteristics: No need to specify K upfront, explores structure at multiple scales. Computationally expensive, doesn't scale well to large datasets.

Density-Based Methods

Find regions of high point density separated by low-density regions. Can discover arbitrary-shaped clusters and identify outliers as noise.

Examples: DBSCAN, HDBSCAN

Characteristics: Handles non-spherical clusters well, automatically identifies outliers. Requires density parameters, struggles with varying cluster densities.

Model-Based Methods

Assume data is generated from a mixture of probability distributions. Each cluster is modeled as a distribution.

Examples: Gaussian Mixture Models (GMM)

Characteristics: Soft clustering (probabilistic membership), handles different shapes and variances. More complex, sensitive to initialization and overfitting.

Hard vs Soft Clustering

Hard clustering: Each point belongs to exactly one cluster. K-Means is a hard clustering method. Simpler to interpret and faster.

Soft clustering: Points have probabilities of belonging to multiple clusters. GMMs are soft clustering methods. Better handles uncertainty at boundaries.

Evaluation

Clustering evaluation is inherently difficult because there's no single correct answer. Quality depends on the goal.

Internal Metrics

Measure cluster quality using only the data and clustering result. Don't require external labels.

Inertia/Within-Cluster Sum of Squares: Measures cluster compactness. Lower is better, but always decreases with more clusters.

Silhouette Score: Compares how similar each point is to its own cluster vs other clusters. Ranges from -1 to 1. Higher is better.

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where a(i) is mean distance to points in the same cluster, b(i) is mean distance to nearest neighboring cluster.

Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances. Lower is better.

Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance. Higher is better.

External Validation

When labels or outcomes are available (not used during clustering), compare clustering results to external information.

Examples:

Check if customer clusters have different churn rates
See if document clusters correspond to topics
Validate clusters against known categories

Adjusted Rand Index (ARI): Measures similarity between two clusterings. Ranges from -1 to 1, with 1 being perfect agreement.

Normalized Mutual Information (NMI): Measures information shared between clusterings. Ranges from 0 to 1.

Qualitative Evaluation

Often the most important assessment. Are clusters:

Interpretable and meaningful?
Actionable for downstream tasks?
Aligned with business goals?
Stable across different runs?

A mathematically optimal clustering may be less useful than one aligned with domain knowledge and business needs.

Choosing a Clustering Method

K-Means: Fast, scalable. Use when clusters are roughly spherical, similar in size, and you need speed. Requires specifying K.

Hierarchical: Exploratory analysis at multiple resolutions. Use for small to medium datasets when you want to understand structure at different scales.

DBSCAN: Arbitrary-shaped clusters, automatic outlier detection. Use when clusters have complex shapes or varying densities. Requires tuning density parameters.

GMM: Soft clustering with probabilistic interpretation. Use when you need cluster probabilities or clusters have different shapes and variances.

Practical Considerations

Feature scaling: Most clustering methods rely on distance. Scale features so no single feature dominates due to magnitude.

Dimensionality reduction: High-dimensional data often benefits from PCA or other reduction before clustering. Curse of dimensionality makes distances less meaningful.

Number of clusters: Use elbow method, silhouette score, gap statistic, or domain knowledge to choose K for partitioning methods.

Initialization: Multiple random starts help avoid poor local minima. K-Means++ improves initialization.

Outliers: Remove or handle outliers before clustering, as they can distort results significantly.

Validation: Always validate clusters qualitatively. Do they make sense? Are they useful?

Clustering is exploratory—there's rarely one correct answer. The best clustering depends on what you're trying to learn or achieve.