Dokumentation (english)

K-Means Clustering

Partitioning data into K groups by distance to cluster centers

K-Means clustering groups data points into K clusters by iteratively assigning points to the nearest cluster center and updating centers based on assigned points. It's simple, fast, and works well when clusters are roughly spherical and well-separated.

How K-Means Works

K-Means alternates between two steps until convergence:

1. Assignment: Assign each point to the nearest centroid based on Euclidean distance.

2. Update: Recompute each centroid as the mean of all points assigned to it.

The algorithm starts with K randomly placed centroids and repeats these steps until assignments stop changing. Each iteration reduces the within-cluster sum of squares (inertia), ensuring convergence.

Example: Clustering customers by session length and purchase frequency. K-Means finds K representative customer profiles (centroids) and assigns each customer to the closest profile.

The Objective Function

K-Means minimizes inertia—the total squared distance from each point to its assigned centroid:

Inertia=i=1nminj=1,...,Kxiμj2\text{Inertia} = \sum_{i=1}^{n} \min_{j=1,...,K} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2

Where xᵢ are data points and μⱼ are centroids.

Squaring distances penalizes far-away points heavily, which is why outliers can significantly distort K-Means results. The mean minimizes squared Euclidean distance, which is why centroids are updated by averaging—this is optimal for the objective.

Choosing K

K-Means requires specifying K upfront. Several methods help choose K:

Elbow Method: Plot inertia vs K. Look for the "elbow" where adding more clusters gives diminishing returns. The curve drops steeply at first, then flattens. The elbow suggests a reasonable K.

Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters. Values range from -1 to 1. Higher average silhouette score indicates better-defined clusters. Compute for different K and look for peaks.

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where a(i) is mean distance to points in the same cluster, b(i) is mean distance to points in the nearest neighboring cluster.

Domain Knowledge: Often the most important factor. Business constraints may dictate K (e.g., "we can support 5 marketing segments"). Inspect clusters qualitatively to ensure they're interpretable and actionable.

Initialization: K-Means++

Random initialization can lead to poor local minima. If two centroids start close together, they compete for the same region while another region has no nearby centroid.

K-Means++ addresses this by spreading out initial centroids:

  1. Choose first centroid uniformly at random from data points
  2. For each subsequent centroid, choose from remaining points with probability proportional to squared distance from nearest existing centroid
  3. Repeat until K centroids are placed

This makes it much more likely that each natural cluster gets an initial centroid, leading to faster convergence and better final results. K-Means++ is the default in most implementations.

Running K-Means multiple times with different initializations and keeping the result with lowest inertia is still recommended for stability.

Feature Scaling

K-Means relies entirely on Euclidean distance. Features on larger scales dominate distance calculations regardless of importance.

Example: Clustering users by session minutes (range 1-60) and lifetime purchases (range 0-10,000). Without scaling, purchase values completely overwhelm session length. Two users with very different behavior but similar spending appear similar.

Solution: Standardize features to zero mean and unit variance before clustering. This ensures all features contribute comparably to distance.

The Curse of Dimensionality

K-Means struggles in high dimensions because:

  • Small differences across many dimensions accumulate into large total distances
  • All points become far apart
  • All distances become similar in magnitude
  • "Nearest" centroids aren't meaningfully near

Mitigation: Apply dimensionality reduction (PCA, feature selection) before K-Means. Reducing to 20-50 dimensions often dramatically improves results.

Assumptions and Limitations

K-Means works best when:

Spherical clusters: Clusters are compact and roughly circular. K-Means uses one center per cluster, so elongated or irregularly shaped clusters don't fit this model.

Similar variance: Clusters have similar spread. Squared distance penalizes large clusters disproportionately.

Similar sizes: Very large clusters can dominate the objective, causing small clusters to be absorbed or split incorrectly.

Euclidean space: Features must be numeric and meaningfully combined under Euclidean distance.

No outliers: Outliers pull centroids away from dense regions due to squared distance.

When these assumptions fail, K-Means produces clusters that don't reflect meaningful structure. Consider density-based methods (DBSCAN) or hierarchical clustering instead.

Outliers and Robustness

Outliers have outsized influence because distance is squared. A single extreme point can pull a centroid far from the cluster's dense region.

Mitigation:

  • Remove or cap extreme values before clustering
  • Apply log transforms to compress large ranges
  • Use robust scaling (median, IQR) instead of mean/variance
  • Inspect centroid stability across multiple runs

If centroids jump dramatically or chase isolated points, outliers are likely distorting results.

Weighted K-Means

Standard K-Means gives all assigned points equal weight when computing centroids. Weighted K-Means assigns importance weights to points, often based on confidence or data quality.

This can reduce the impact of noisy or uncertain observations while emphasizing reliable data.

Computational Complexity

Per iteration: O(n · K · d) where n is points, K is clusters, d is dimensions. Computing distances dominates runtime.

Total: Usually converges in relatively few iterations (often 10-50), making K-Means practical for large datasets.

Mini-Batch K-Means: For very large datasets, update centroids using random subsets (mini-batches) instead of all points. Trades slight accuracy loss for major speed gains.

Evaluation

For general clustering evaluation metrics, see Clustering Evaluation.

K-Means specific evaluation focuses on:

  • Inertia: The within-cluster sum of squares that K-Means minimizes
  • Silhouette score: How well points fit their assigned cluster vs other clusters
  • Stability: Do multiple runs produce similar clusterings?

When to Use K-Means

K-Means works well when:

  • Clusters are roughly spherical and similar in size
  • Features are numeric and scaled appropriately
  • Dataset is large and scalability matters
  • You need fast, interpretable results
  • Clusters are well-separated

K-Means struggles when:

  • Clusters have irregular shapes or very different densities
  • Data has many outliers
  • Features are high-dimensional without reduction
  • Cluster sizes vary dramatically
  • Distance doesn't meaningfully capture similarity

K-Means is an excellent baseline for clustering problems. Start with K-Means, and if assumptions don't hold, consider alternatives.


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor etwa 13 Stunden
Release: v4.0.0-production
Buildnummer: master@27db988
Historie: 34 Items