Clustering
Grouping data points by similarity without labeled examples
Clustering is an unsupervised learning task that groups data points so that points within the same group are more similar to each other than to points in other groups. Unlike classification, there are no labels—the algorithm discovers structure in unlabeled data.
What Is Clustering
Clustering algorithms partition data into groups (clusters) based on similarity. The definition of similarity depends on the distance metric and the clustering method, but the goal is always to uncover natural groupings in the data.
Common use cases:
- Customer segmentation for marketing
- Anomaly detection by identifying outliers
- Data exploration and visualization
- Document or image organization
- Compression by representing groups with centroids
Types of Clustering
Partitioning Methods
Divide data into K non-overlapping groups. Each point belongs to exactly one cluster.
Examples: K-Means, K-Medoids
Characteristics: Fast, scalable, require specifying number of clusters upfront. Work best with spherical clusters of similar size.
Hierarchical Methods
Build a tree of clusters by either merging smaller clusters (agglomerative) or splitting larger ones (divisive). Produces a dendrogram showing relationships at multiple resolutions.
Characteristics: No need to specify K upfront, explores structure at multiple scales. Computationally expensive, doesn't scale well to large datasets.
Density-Based Methods
Find regions of high point density separated by low-density regions. Can discover arbitrary-shaped clusters and identify outliers as noise.
Examples: DBSCAN, HDBSCAN
Characteristics: Handles non-spherical clusters well, automatically identifies outliers. Requires density parameters, struggles with varying cluster densities.
Model-Based Methods
Assume data is generated from a mixture of probability distributions. Each cluster is modeled as a distribution.
Examples: Gaussian Mixture Models (GMM)
Characteristics: Soft clustering (probabilistic membership), handles different shapes and variances. More complex, sensitive to initialization and overfitting.
Hard vs Soft Clustering
Hard clustering: Each point belongs to exactly one cluster. K-Means is a hard clustering method. Simpler to interpret and faster.
Soft clustering: Points have probabilities of belonging to multiple clusters. GMMs are soft clustering methods. Better handles uncertainty at boundaries.
Evaluation
Clustering evaluation is inherently difficult because there's no single correct answer. Quality depends on the goal.
Internal Metrics
Measure cluster quality using only the data and clustering result. Don't require external labels.
Inertia/Within-Cluster Sum of Squares: Measures cluster compactness. Lower is better, but always decreases with more clusters.
Silhouette Score: Compares how similar each point is to its own cluster vs other clusters. Ranges from -1 to 1. Higher is better.
Where a(i) is mean distance to points in the same cluster, b(i) is mean distance to nearest neighboring cluster.
Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances. Lower is better.
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance. Higher is better.
External Validation
When labels or outcomes are available (not used during clustering), compare clustering results to external information.
Examples:
- Check if customer clusters have different churn rates
- See if document clusters correspond to topics
- Validate clusters against known categories
Adjusted Rand Index (ARI): Measures similarity between two clusterings. Ranges from -1 to 1, with 1 being perfect agreement.
Normalized Mutual Information (NMI): Measures information shared between clusterings. Ranges from 0 to 1.
Qualitative Evaluation
Often the most important assessment. Are clusters:
- Interpretable and meaningful?
- Actionable for downstream tasks?
- Aligned with business goals?
- Stable across different runs?
A mathematically optimal clustering may be less useful than one aligned with domain knowledge and business needs.
Choosing a Clustering Method
K-Means: Fast, scalable. Use when clusters are roughly spherical, similar in size, and you need speed. Requires specifying K.
Hierarchical: Exploratory analysis at multiple resolutions. Use for small to medium datasets when you want to understand structure at different scales.
DBSCAN: Arbitrary-shaped clusters, automatic outlier detection. Use when clusters have complex shapes or varying densities. Requires tuning density parameters.
GMM: Soft clustering with probabilistic interpretation. Use when you need cluster probabilities or clusters have different shapes and variances.
Practical Considerations
Feature scaling: Most clustering methods rely on distance. Scale features so no single feature dominates due to magnitude.
Dimensionality reduction: High-dimensional data often benefits from PCA or other reduction before clustering. Curse of dimensionality makes distances less meaningful.
Number of clusters: Use elbow method, silhouette score, gap statistic, or domain knowledge to choose K for partitioning methods.
Initialization: Multiple random starts help avoid poor local minima. K-Means++ improves initialization.
Outliers: Remove or handle outliers before clustering, as they can distort results significantly.
Validation: Always validate clusters qualitatively. Do they make sense? Are they useful?
Clustering is exploratory—there's rarely one correct answer. The best clustering depends on what you're trying to learn or achieve.