t-SNE
t-Distributed Stochastic Neighbor Embedding for visualizing high-dimensional data in 2D or 3D
t-SNE
t-Distributed Stochastic Neighbor Embedding for visualizing high-dimensional data in 2D or 3D.
When to use:
- Need 2D/3D visualization
- Want to reveal cluster structure
- Have moderate dataset (<10k samples)
- Don't need to transform new data
- Exploration and presentation
Strengths: Excellent visualizations, reveals clusters beautifully, preserves local structure Weaknesses: Very slow, no inference on new data, sensitive to hyperparameters, different runs give different results, doesn't preserve global structure
Model Parameters
N Components (default: 2, required) Embedding dimensions (typically 2 or 3 for visualization).
- Max: 3 (designed for visualization)
Perplexity (default: 30.0) Balance between local and global structure. Roughly the number of close neighbors.
- Small (5-15): Emphasizes local structure, many small clusters
- Medium (30-50): Balanced (default)
- Large (50-100): More global structure, fewer clusters
- Rule: Should be less than n_samples
- Larger datasets need larger perplexity
Learning Rate (default: 200.0) Step size for gradient descent optimization.
- Too low (<10): Slow convergence, poor results
- Good range (10-1000): Depends on data
- Too high (>1000): Unstable, poor results
- Try: [10, 100, 200, 500, 1000]
Max Iterations (default: 1000) Number of optimization iterations.
- Minimum 250: Very fast but may not converge
- 1000: Standard (default)
- 2000-5000: Better convergence for difficult data
Metric (default: "euclidean") Distance metric for high-dimensional space:
- euclidean: Standard distance (default)
- manhattan: L1 distance
- cosine: Angle similarity
- correlation: Pearson correlation
Random State (default: 42) Seed for reproducibility (t-SNE is stochastic).