K-Means
Fast and scalable algorithm that partitions data into k clusters by minimizing within-cluster variance
Fast and scalable algorithm that partitions data into k clusters by minimizing within-cluster variance.
When to use:
- Know approximately how many clusters to expect
- Clusters are roughly spherical and similar size
- Need fast results on large datasets
- Good starting point for exploration
Strengths: Very fast, scalable to large datasets, simple and interpretable, consistent results Weaknesses: Must specify k in advance, assumes spherical clusters, sensitive to outliers, poor with varying cluster sizes
Model Parameters
N Clusters (default: 8, required) Number of clusters to form. This is the most important parameter.
- Too low: Merges distinct groups
- Too high: Splits natural groups
- Use elbow method or silhouette analysis to find optimal k
Init Method (default: "k-means++") How to initialize cluster centers:
- k-means++: Smart initialization (default, better convergence)
- random: Random initialization (faster but may give poor results)
Max Iterations (default: 300) Maximum number of iterations for convergence.
- 100-300: Usually sufficient
- 500+: For difficult datasets or large k
Random State (default: 42) Seed for reproducibility. Keep consistent for comparable results.