Dimensionality Reduction
Reduce high-dimensional data while preserving important information
Dimensionality reduction transforms high-dimensional data into fewer dimensions while preserving essential patterns and structure. Use it for visualization, noise reduction, feature extraction, or speeding up downstream models.
🎓 Learn About Dimensionality Reduction
New to dimensionality reduction? Visit our Dimensionality Reduction Concepts Guide to learn about the curse of dimensionality, evaluation metrics (Explained Variance, Trustworthiness), and when to use these techniques for your data.
Available Models
We support 14 different dimensionality reduction techniques:
Linear Methods
- PCA - Principal Component Analysis for variance-based reduction
- Truncated SVD - SVD for sparse matrices and text data
- Factor Analysis - Statistical method for latent factors
- ICA - Independent Component Analysis for source separation
- NMF - Non-negative matrix factorization for parts-based representations
- LDA - Linear Discriminant Analysis (supervised)
Non-Linear Manifold Methods
- t-SNE - Powerful visualization for 2D/3D embeddings
- UMAP - Fast, preserves both local and global structure
- Isomap - Geodesic distance preservation
- LLE - Locally Linear Embedding for local relationships
- MDS - Multidimensional Scaling for distance preservation
- Spectral Embedding - Graph-based embedding
Kernel Methods
- Kernel PCA - Non-linear PCA with kernel trick
Common Configuration
Feature Configuration
Feature Columns (required) Select which columns to use for dimensionality reduction. Include all relevant numerical features that contribute to the patterns you want to preserve.
Number of Components (required) Target number of dimensions. Common choices:
- 2-3: For visualization
- 10-50: For feature extraction before modeling
- ~80-95% variance: Use scree plot to determine
Hyperparameter Tuning
Some models support hyperparameter tuning:
- Grid Search: Systematic exploration
- Random Search: Faster approximate search
- Bayesian Search: Intelligent optimization
Scoring Metrics:
- Explained Variance: For linear methods (higher is better)
- Trustworthiness: For manifold methods (higher is better, 0-1)
- Reconstruction Error: How well data can be reconstructed (lower is better)
Understanding Dimensionality Reduction Metrics
Explained Variance Ratio
Proportion of variance explained by each component (linear methods).
- Use for: PCA, Factor Analysis, Truncated SVD
- Interpretation: Sum should be high (>80-95% for good reduction)
- Cumulative plot: Shows how many components needed for desired variance
Trustworthiness
Measures whether nearest neighbors in high-D remain nearest in low-D (0-1, higher is better).
- Use for: t-SNE, UMAP, manifold methods
- Good: >0.9 (excellent), 0.8-0.9 (good)
- Interpretation: How well local structure is preserved
Reconstruction Error
Error when reconstructing original data from reduced representation.
- Lower is better: Smaller error = better preservation
- Use for: All methods that support inverse_transform
Stress (for MDS)
Measure of discrepancy between distances (lower is better).
- Good: <0.1 (excellent), 0.1-0.2 (good)
- Poor: >0.2 (poor fit)
Choosing the Right Model
Quick Start Guide
- Start with PCA: Fast baseline, interpretable
- Try UMAP: If non-linear structure expected
- Use t-SNE: For beautiful 2D visualizations
- Go supervised (LDA): If you have labels and want separation
By Goal
Visualization (2D/3D):
- Best: t-SNE, UMAP
- Fast: PCA
- With labels: LDA
Feature Extraction (before modeling):
- Best: PCA, UMAP
- With labels: LDA
- Text data: Truncated SVD
- Non-negative: NMF
Data Compression:
- Best: PCA, Truncated SVD
- Non-negative: NMF
Noise Reduction:
- Best: PCA, Factor Analysis
- Signal separation: ICA
By Data Type
Dense numerical:
- PCA, UMAP, t-SNE, Kernel PCA
Sparse (text):
- Truncated SVD, NMF
Images:
- PCA, NMF
Time series / signals:
- ICA, PCA
With labels:
- LDA
Non-negative:
- NMF
By Data Size
Small (<1k samples):
- Any method
Medium (1k-10k):
- PCA, UMAP, LDA, Truncated SVD
Large (>10k):
- PCA, UMAP, Truncated SVD
- Avoid: t-SNE, MDS, LLE
By Requirements
Need inference on new data:
- Yes: PCA, UMAP, Truncated SVD, LDA, Kernel PCA, Isomap, ICA, NMF, Factor Analysis
- No: t-SNE, LLE, MDS, Spectral Embedding
Need interpretability:
- High: PCA, LDA, NMF, Factor Analysis
- Medium: Truncated SVD, ICA
- Low: t-SNE, UMAP, Kernel PCA
Need speed:
- Fastest: PCA, Truncated SVD
- Fast: UMAP, LDA
- Slow: t-SNE, MDS, LLE
Best Practices
- Scale your features - Essential for distance-based methods (PCA, t-SNE, UMAP)
- Start with PCA - Fast baseline to understand your data
- Check explained variance - Plot cumulative variance to choose n_components
- Try multiple methods - Different methods reveal different aspects
- Tune hyperparameters - Especially for t-SNE (perplexity, learning_rate) and UMAP (n_neighbors, min_dist)
- Validate results - Use downstream task performance or visualization quality
- Use appropriate metrics - Explained variance for linear, trustworthiness for manifold methods
- Consider data size - Large datasets need scalable methods (PCA, UMAP)
- Match method to goal - Visualization vs. feature extraction need different approaches
- Reproducibility - Always set random_state for stochastic methods
Common Pitfalls
- Not scaling data: Distance-based methods are sensitive to feature scales
- Too few components: Missing important variance/structure
- Too many components: Including noise
- Wrong method for goal: t-SNE for feature extraction (no inference!)
- Ignoring explained variance: Not checking how much information is preserved
- Over-interpreting t-SNE: Distances between clusters are not meaningful
- Default hyperparameters: t-SNE and UMAP benefit greatly from tuning
- Using on small data: Manifold methods need sufficient samples
Tips for Better Results
For PCA:
- Plot scree plot (explained variance)
- Check component loadings for interpretation
- Scale features to same range
For t-SNE:
- Run multiple times with different perplexities (5, 30, 50, 100)
- Increase iterations if plot still changing
- Try different random seeds
- Don't over-interpret global structure
For UMAP:
- Tune n_neighbors (local vs. global trade-off)
- Tune min_dist (clumpy vs. spread out)
- Works well with >2 components for feature extraction
- Much faster than t-SNE
For LDA:
- Need sufficient samples per class
- Works best with normally distributed classes
- Max components = n_classes - 1
Next Steps
Ready to reduce? Head to the Training page and:
- Select your dataset
- Scale your features if needed
- Choose a method based on this guide
- Start with 2 components for visualization
- Evaluate with appropriate metrics
- Tune hyperparameters for better results
- Use reduced data for visualization or downstream tasks