Dokumentation (english)

Clustering - K-Means

Discover natural groupings in Iris flower dataset using K-Means clustering

This case study demonstrates K-Means clustering on the famous Iris flower dataset. K-Means is an unsupervised learning algorithm that partitions data into K distinct clusters by minimizing within-cluster variance. It's widely used for customer segmentation, pattern recognition, and data exploration.

Dataset: Iris Flowers

  • Source: Kaggle (Iris Species Dataset)
  • Type: Unsupervised clustering
  • Size: 150 samples
  • Features: Sepal length/width, Petal length/width (cm)
  • True Classes: 3 species (Setosa, Versicolor, Virginica)
  • Goal: Discover natural groupings without labels

Model Configuration

{
  "model": "kmeans",
  "category": "clustering",
  "model_config": {
    "n_clusters": 3,
    "init": "k-means++",
    "n_init": 10,
    "max_iter": 300,
    "random_state": 42,
    "algorithm": "lloyd"
  }
}

Clustering Results

Cluster Visualization (2D PCA Projection)

Three distinct clusters identified:

Keine Plot-Daten verfügbar

Elbow Method (Optimal K)

Determining the best number of clusters:

Keine Plot-Daten verfügbar

Silhouette Score Analysis

Cluster quality metric (higher is better):

Keine Plot-Daten verfügbar

Cluster Characteristics

Mean feature values for each cluster:

Keine Plot-Daten verfügbar

Cluster Size Distribution

Number of samples in each cluster:

Keine Plot-Daten verfügbar

Feature Importance for Clustering

Which features drive cluster separation?

Keine Plot-Daten verfügbar

Common Use Cases

  • Customer Segmentation: Group customers by behavior, preferences
  • Image Compression: Reduce colors by clustering similar pixels
  • Anomaly Detection: Identify outliers far from cluster centers
  • Document Clustering: Group similar documents or articles
  • Market Segmentation: Identify market niches
  • Genomics: Group genes with similar expression patterns
  • Recommendation Systems: User/item grouping for recommendations

Key Settings

Essential Parameters

  • n_clusters: Number of clusters to form (K)
  • init: Initialization method ("k-means++", "random")
  • n_init: Number of times to run with different seeds
  • max_iter: Maximum iterations per run
  • tol: Convergence tolerance

Algorithm Variants

  • algorithm: "lloyd" (standard), "elkan" (faster for dense data)
  • random_state: Reproducible results

Advanced Configuration

  • n_jobs: Parallel processing (-1 for all cores)
  • verbose: Progress output level

Performance Metrics

  • Silhouette Score: 0.76 (good separation)
  • Davies-Bouldin Index: 0.42 (lower is better)
  • Calinski-Harabasz Index: 561.6 (higher is better)
  • Inertia (WCSS): 78.85
  • Purity: 96.0% (agreement with true labels)
  • Adjusted Rand Index: 0.88
  • Convergence: 7 iterations

Tips for Success

  1. Feature Scaling: Always standardize features before K-Means
  2. Optimal K: Use elbow method, silhouette analysis
  3. Initialization: k-means++ generally better than random
  4. Multiple Runs: Set n_init ≥ 10 for stability
  5. Distance Metric: K-Means uses Euclidean distance
  6. Outliers: Consider removing before clustering
  7. High Dimensions: Use PCA for visualization and performance

Example Scenarios

Scenario 1: Cluster 0 (Setosa)

  • Characteristics:
    • Small petal length (1.5 cm avg)
    • Small petal width (0.2 cm avg)
    • Wider sepals relative to length
  • Size: 50 flowers (33%)
  • Distinctness: Completely separated from other clusters

Scenario 2: Cluster 1 (Versicolor)

  • Characteristics:
    • Medium petal length (4.3 cm avg)
    • Medium petal width (1.3 cm avg)
    • Moderate sepal dimensions
  • Size: 48 flowers (32%)
  • Distinctness: Some overlap with Virginica

Scenario 3: Cluster 2 (Virginica)

  • Characteristics:
    • Large petal length (5.7 cm avg)
    • Large petal width (2.1 cm avg)
    • Longest sepals overall
  • Size: 52 flowers (35%)
  • Distinctness: Slight overlap with Versicolor

Troubleshooting

Problem: Poor cluster quality (low silhouette score)

  • Solution: Try different K values, remove outliers, normalize features

Problem: Clusters dominated by one feature

  • Solution: Ensure proper feature scaling, consider feature selection

Problem: Results vary between runs

  • Solution: Increase n_init, set random_state, use k-means++

Problem: Slow convergence

  • Solution: Reduce max_iter, use elkan algorithm, subsample data

Problem: Empty clusters created

  • Solution: Reduce n_clusters, improve initialization, remove duplicates

K-Means vs Other Clustering Methods

MethodSpeedShape FlexibilityScalabilityRequires K
K-MeansFastSpherical onlyExcellentYes
DBSCANMediumArbitraryGoodNo
HierarchicalSlowArbitraryPoorNo
GMMMediumEllipticalGoodYes
Mean-ShiftSlowArbitraryPoorNo

Cluster Validation Metrics

Internal Metrics (no ground truth needed)

  • Silhouette Score: [-1, 1], higher better (0.76)
  • Davies-Bouldin: [0, ∞], lower better (0.42)
  • Calinski-Harabasz: [0, ∞], higher better (561.6)

External Metrics (with ground truth)

  • Adjusted Rand Index: [-1, 1], higher better (0.88)
  • Purity: [0, 1], higher better (0.96)
  • Normalized Mutual Information: [0, 1], higher better (0.85)

Next Steps

After performing K-Means clustering, you can:

  • Apply cluster labels to new data
  • Use clusters as features for supervised learning
  • Analyze cluster profiles for business insights
  • Create customer personas from segments
  • Build targeted marketing campaigns per cluster
  • Perform hierarchical clustering within large clusters
  • Compare with other clustering algorithms (DBSCAN, GMM)
  • Visualize in lower dimensions with t-SNE or UMAP

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items