Dokumentation (english)

Dimensionality Reduction - PCA

Reduce feature dimensions using PCA on Breast Cancer dataset

This case study demonstrates Principal Component Analysis (PCA) for dimensionality reduction on the Wisconsin Breast Cancer dataset. PCA transforms high-dimensional data into a lower-dimensional space while preserving maximum variance, making it invaluable for visualization, noise reduction, and computational efficiency.

Dataset: Wisconsin Breast Cancer

  • Source: Kaggle (Breast Cancer Wisconsin Diagnostic)
  • Type: Dimensionality reduction
  • Size: 569 samples
  • Original Features: 30 measurements (radius, texture, perimeter, area, smoothness, etc.)
  • Target Dimensions: Reduce to 2-3 components
  • Classes: Malignant (212), Benign (357)

Model Configuration

{
  "model": "pca",
  "category": "dimensionality_reduction",
  "model_config": {
    "n_components": 2,
    "whiten": false,
    "svd_solver": "auto",
    "random_state": 42
  }
}

Dimensionality Reduction Results

Variance Explained

How much information is retained in each component:

Keine Plot-Daten verfügbar

2D Visualization

Data projected onto first two principal components:

Keine Plot-Daten verfügbar

3D Visualization

Three-dimensional representation captures 72.7% variance:

Keine Plot-Daten verfügbar

Feature Contributions to PC1

Which original features contribute most to the first component:

Keine Plot-Daten verfügbar

Reconstruction Error

Error when reconstructing original data from reduced dimensions:

Keine Plot-Daten verfügbar

Classification Performance After PCA

Downstream task performance with reduced dimensions:

Keine Plot-Daten verfügbar

Computational Speedup

Training time reduction with fewer features:

Keine Plot-Daten verfügbar

Common Use Cases

  • Data Visualization: Plot high-dimensional data in 2D/3D
  • Noise Reduction: Remove less important variance
  • Feature Extraction: Create new features for ML models
  • Compression: Reduce storage and transmission costs
  • Preprocessing: Remove multicollinearity before regression
  • Image Compression: Reduce image storage (eigenfaces)
  • Anomaly Detection: Detect outliers in reduced space
  • Exploratory Analysis: Understand data structure and patterns

Key Settings

Essential Parameters

  • n_components: Number of components to keep (int or float for variance threshold)
  • svd_solver: "auto", "full", "arpack", "randomized"
  • whiten: Whether to normalize components
  • random_state: For reproducible results

Component Selection Strategies

  • Fixed Number: n_components=2 (for visualization)
  • Variance Threshold: n_components=0.95 (keep 95% variance)
  • Elbow Method: Find "knee" in scree plot
  • Cross-Validation: Optimize for downstream task

Advanced Configuration

  • tol: Tolerance for singular value computation
  • iterated_power: Number of iterations for randomized solver
  • n_oversamples: Additional samples for randomized solver

Performance Metrics

  • Explained Variance (2 PCs): 63.3%
  • Reconstruction Error (2 PCs): 36.7%
  • Dimensionality Reduction: 30D → 2D (93.3% reduction)
  • Computational Speedup: 9x faster training
  • Classification Accuracy Loss: 4.8% (97.2% → 92.4%)
  • PCA Fitting Time: 0.003 seconds
  • Transform Time: 0.001 seconds per sample

Tips for Success

  1. Standardization: Always standardize features before PCA
  2. Variance Threshold: Choose based on downstream task requirements
  3. Interpretability: First PCs often represent meaningful patterns
  4. Incremental PCA: Use for large datasets that don't fit in memory
  5. Kernel PCA: For non-linear dimensionality reduction
  6. Sparse PCA: When interpretability with few features is important
  7. Visualization: Use 2-3 PCs for visualization, more for ML

Example Scenarios

Scenario 1: Visualization (2 PCs)

  • Purpose: Visualize 30D cancer data in 2D
  • Variance Retained: 63.3%
  • Result: Clear separation between malignant and benign
  • Use Case: Exploratory analysis, presentations

Scenario 2: Noise Reduction (10 PCs)

  • Purpose: Remove noisy features for ML model
  • Variance Retained: 97.2%
  • Accuracy Loss: 0.4% (minimal)
  • Speedup: 2.4x faster training
  • Use Case: Improve model generalization

Scenario 3: Feature Engineering (5 PCs)

  • Purpose: Create compact feature set
  • Variance Retained: 84.8%
  • Accuracy Loss: 1.6%
  • Speedup: 4.8x faster
  • Use Case: Real-time inference, embedded systems

Troubleshooting

Problem: Low variance explained with few components

  • Solution: Dataset may need more PCs, or use non-linear methods (t-SNE, UMAP)

Problem: Poor downstream task performance

  • Solution: Increase n_components, try supervised dimensionality reduction (LDA)

Problem: First PC dominated by single feature

  • Solution: Ensure proper feature scaling, check for outliers

Problem: Negative values in transformed data

  • Solution: Normal behavior - PCA centers data at origin

Problem: Slow computation on large dataset

  • Solution: Use randomized SVD solver, or IncrementalPCA

PCA vs Other Dimensionality Reduction Methods

MethodTypeSpeedPreservesBest For
PCALinearFastGlobal structureGeneral use
t-SNENon-linearSlowLocal structureVisualization
UMAPNon-linearMediumBothVisualization + ML
LDASupervisedFastClass separationClassification
AutoencoderNon-linearSlowComplex patternsDeep learning
ICALinearMediumIndependenceSignal processing

Understanding Principal Components

PC1 (44.3% variance)

  • Interpretation: Overall tumor size and malignancy severity
  • High values: Larger, more severe tumors
  • Key features: Mean radius, perimeter, area, concavity

PC2 (19.0% variance)

  • Interpretation: Tumor texture and shape irregularity
  • High values: More irregular, textured tumors
  • Key features: Mean texture, smoothness, symmetry

Next Steps

After performing PCA, you can:

  • Use reduced features for faster ML training
  • Create visualizations for exploratory analysis
  • Apply to new data using fitted PCA transformer
  • Combine with clustering (K-Means on PCs)
  • Use for anomaly detection (reconstruction error)
  • Try non-linear alternatives (Kernel PCA, t-SNE, UMAP)
  • Perform feature selection based on loadings
  • Build compressed data pipelines for production

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items