BIRCH
Scalable incremental clustering for large datasets
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) builds a compact CF tree summary of the data and performs clustering on the summary. It is designed for very large datasets that don't fit in memory.
When to use:
- Very large datasets where other algorithms are too slow
- Streaming or incremental data where the model needs updating without full retraining
- When memory efficiency is critical
Input: Tabular data with the feature columns defined during training Output: Cluster label for each row
Model Settings (set during training, used at inference)
N Clusters (default: 3)
Number of final clusters after the optional refinement step. Set to null to return CF subclusters directly.
Threshold (default: 0.5) Maximum radius of subclusters in the CF tree. Smaller values create more, finer subclusters.
Branching Factor (default: 50) Maximum CF entries per node in the tree. Controls the tree structure.
Inference Settings
No dedicated inference-time settings. New points traverse the CF tree to find their nearest subcluster.