Dimensionality Reduction

Dimensionality reduction transforms high-dimensional data into fewer dimensions while preserving essential patterns and structure. Use it for visualization, noise reduction, feature extraction, or speeding up downstream models.

🎓 Learn About Dimensionality Reduction

New to dimensionality reduction? Visit our Dimensionality Reduction Concepts Guide to learn about the curse of dimensionality, evaluation metrics (Explained Variance, Trustworthiness), and when to use these techniques for your data.

Available Models

We support 14 different dimensionality reduction techniques:

Linear Methods

PCA - Principal Component Analysis for variance-based reduction
Truncated SVD - SVD for sparse matrices and text data
Factor Analysis - Statistical method for latent factors
ICA - Independent Component Analysis for source separation
NMF - Non-negative matrix factorization for parts-based representations
LDA - Linear Discriminant Analysis (supervised)

Non-Linear Manifold Methods

t-SNE - Powerful visualization for 2D/3D embeddings
UMAP - Fast, preserves both local and global structure
Isomap - Geodesic distance preservation
LLE - Locally Linear Embedding for local relationships
MDS - Multidimensional Scaling for distance preservation
Spectral Embedding - Graph-based embedding

Kernel Methods

Kernel PCA - Non-linear PCA with kernel trick

Common Configuration

Feature Configuration

Feature Columns (required) Select which columns to use for dimensionality reduction. Include all relevant numerical features that contribute to the patterns you want to preserve.

Number of Components (required) Target number of dimensions. Common choices:

2-3: For visualization
10-50: For feature extraction before modeling
~80-95% variance: Use scree plot to determine

Hyperparameter Tuning

Some models support hyperparameter tuning:

Grid Search: Systematic exploration
Random Search: Faster approximate search
Bayesian Search: Intelligent optimization

Scoring Metrics:

Explained Variance: For linear methods (higher is better)
Trustworthiness: For manifold methods (higher is better, 0-1)
Reconstruction Error: How well data can be reconstructed (lower is better)

Understanding Dimensionality Reduction Metrics

Explained Variance Ratio

Proportion of variance explained by each component (linear methods).

Use for: PCA, Factor Analysis, Truncated SVD
Interpretation: Sum should be high (>80-95% for good reduction)
Cumulative plot: Shows how many components needed for desired variance

Trustworthiness

Measures whether nearest neighbors in high-D remain nearest in low-D (0-1, higher is better).

Use for: t-SNE, UMAP, manifold methods
Good: >0.9 (excellent), 0.8-0.9 (good)
Interpretation: How well local structure is preserved

Reconstruction Error

Error when reconstructing original data from reduced representation.

Lower is better: Smaller error = better preservation
Use for: All methods that support inverse_transform

Stress (for MDS)

Measure of discrepancy between distances (lower is better).

Good: <0.1 (excellent), 0.1-0.2 (good)
Poor: >0.2 (poor fit)

Choosing the Right Model

Quick Start Guide

Start with PCA: Fast baseline, interpretable
Try UMAP: If non-linear structure expected
Use t-SNE: For beautiful 2D visualizations
Go supervised (LDA): If you have labels and want separation

By Goal

Visualization (2D/3D):

Best: t-SNE, UMAP
Fast: PCA
With labels: LDA

Feature Extraction (before modeling):

Best: PCA, UMAP
With labels: LDA
Text data: Truncated SVD
Non-negative: NMF

Data Compression:

Best: PCA, Truncated SVD
Non-negative: NMF

Noise Reduction:

Best: PCA, Factor Analysis
Signal separation: ICA

By Data Type

Dense numerical:

PCA, UMAP, t-SNE, Kernel PCA

Sparse (text):

Truncated SVD, NMF

Images:

PCA, NMF

Time series / signals:

ICA, PCA

With labels:

Non-negative:

By Data Size

Small (<1k samples):

Any method

Medium (1k-10k):

PCA, UMAP, LDA, Truncated SVD

Large (>10k):

PCA, UMAP, Truncated SVD
Avoid: t-SNE, MDS, LLE

By Requirements

Need inference on new data:

Yes: PCA, UMAP, Truncated SVD, LDA, Kernel PCA, Isomap, ICA, NMF, Factor Analysis
No: t-SNE, LLE, MDS, Spectral Embedding

Need interpretability:

High: PCA, LDA, NMF, Factor Analysis
Medium: Truncated SVD, ICA
Low: t-SNE, UMAP, Kernel PCA

Need speed:

Fastest: PCA, Truncated SVD
Fast: UMAP, LDA
Slow: t-SNE, MDS, LLE

Best Practices

Scale your features - Essential for distance-based methods (PCA, t-SNE, UMAP)
Start with PCA - Fast baseline to understand your data
Check explained variance - Plot cumulative variance to choose n_components
Try multiple methods - Different methods reveal different aspects
Tune hyperparameters - Especially for t-SNE (perplexity, learning_rate) and UMAP (n_neighbors, min_dist)
Validate results - Use downstream task performance or visualization quality
Use appropriate metrics - Explained variance for linear, trustworthiness for manifold methods
Consider data size - Large datasets need scalable methods (PCA, UMAP)
Match method to goal - Visualization vs. feature extraction need different approaches
Reproducibility - Always set random_state for stochastic methods

Common Pitfalls

Not scaling data: Distance-based methods are sensitive to feature scales
Too few components: Missing important variance/structure
Too many components: Including noise
Wrong method for goal: t-SNE for feature extraction (no inference!)
Ignoring explained variance: Not checking how much information is preserved
Over-interpreting t-SNE: Distances between clusters are not meaningful
Default hyperparameters: t-SNE and UMAP benefit greatly from tuning
Using on small data: Manifold methods need sufficient samples

Tips for Better Results

For PCA:

Plot scree plot (explained variance)
Check component loadings for interpretation
Scale features to same range

For t-SNE:

Run multiple times with different perplexities (5, 30, 50, 100)
Increase iterations if plot still changing
Try different random seeds
Don't over-interpret global structure

For UMAP:

Tune n_neighbors (local vs. global trade-off)
Tune min_dist (clumpy vs. spread out)
Works well with >2 components for feature extraction
Much faster than t-SNE

For LDA:

Need sufficient samples per class
Works best with normally distributed classes
Max components = n_classes - 1

Next Steps

Ready to reduce? Head to the Training page and:

Select your dataset
Scale your features if needed
Choose a method based on this guide
Start with 2 components for visualization
Evaluate with appropriate metrics
Tune hyperparameters for better results
Use reduced data for visualization or downstream tasks

Dimensionality Reduction

Available Models

Linear Methods

Non-Linear Manifold Methods

Kernel Methods

Common Configuration

Feature Configuration

Hyperparameter Tuning

Understanding Dimensionality Reduction Metrics

Explained Variance Ratio

Trustworthiness

Reconstruction Error

Stress (for MDS)

Choosing the Right Model

Quick Start Guide

By Goal

By Data Type

By Data Size

By Requirements

Best Practices

Common Pitfalls

Tips for Better Results

Next Steps

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Dimensionality Reduction

Available Models

Linear Methods

Non-Linear Manifold Methods

Kernel Methods

Common Configuration

Feature Configuration

Hyperparameter Tuning

Understanding Dimensionality Reduction Metrics

Explained Variance Ratio

Trustworthiness

Reconstruction Error

Stress (for MDS)

Choosing the Right Model

Quick Start Guide

By Goal

By Data Type

By Data Size

By Requirements

Best Practices

Common Pitfalls

Tips for Better Results

Next Steps

On this page

Command Palette