Dimensionality Reduction - PCA
Reduce feature dimensions using PCA on Breast Cancer dataset
This case study demonstrates Principal Component Analysis (PCA) for dimensionality reduction on the Wisconsin Breast Cancer dataset. PCA transforms high-dimensional data into a lower-dimensional space while preserving maximum variance, making it invaluable for visualization, noise reduction, and computational efficiency.
Dataset: Wisconsin Breast Cancer
- Source: Kaggle (Breast Cancer Wisconsin Diagnostic)
- Type: Dimensionality reduction
- Size: 569 samples
- Original Features: 30 measurements (radius, texture, perimeter, area, smoothness, etc.)
- Target Dimensions: Reduce to 2-3 components
- Classes: Malignant (212), Benign (357)
Model Configuration
{
"model": "pca",
"category": "dimensionality_reduction",
"model_config": {
"n_components": 2,
"whiten": false,
"svd_solver": "auto",
"random_state": 42
}
}Dimensionality Reduction Results
Variance Explained
How much information is retained in each component:
Keine Plot-Daten verfügbar
2D Visualization
Data projected onto first two principal components:
Keine Plot-Daten verfügbar
3D Visualization
Three-dimensional representation captures 72.7% variance:
Keine Plot-Daten verfügbar
Feature Contributions to PC1
Which original features contribute most to the first component:
Keine Plot-Daten verfügbar
Reconstruction Error
Error when reconstructing original data from reduced dimensions:
Keine Plot-Daten verfügbar
Classification Performance After PCA
Downstream task performance with reduced dimensions:
Keine Plot-Daten verfügbar
Computational Speedup
Training time reduction with fewer features:
Keine Plot-Daten verfügbar
Common Use Cases
- Data Visualization: Plot high-dimensional data in 2D/3D
- Noise Reduction: Remove less important variance
- Feature Extraction: Create new features for ML models
- Compression: Reduce storage and transmission costs
- Preprocessing: Remove multicollinearity before regression
- Image Compression: Reduce image storage (eigenfaces)
- Anomaly Detection: Detect outliers in reduced space
- Exploratory Analysis: Understand data structure and patterns
Key Settings
Essential Parameters
- n_components: Number of components to keep (int or float for variance threshold)
- svd_solver: "auto", "full", "arpack", "randomized"
- whiten: Whether to normalize components
- random_state: For reproducible results
Component Selection Strategies
- Fixed Number: n_components=2 (for visualization)
- Variance Threshold: n_components=0.95 (keep 95% variance)
- Elbow Method: Find "knee" in scree plot
- Cross-Validation: Optimize for downstream task
Advanced Configuration
- tol: Tolerance for singular value computation
- iterated_power: Number of iterations for randomized solver
- n_oversamples: Additional samples for randomized solver
Performance Metrics
- Explained Variance (2 PCs): 63.3%
- Reconstruction Error (2 PCs): 36.7%
- Dimensionality Reduction: 30D → 2D (93.3% reduction)
- Computational Speedup: 9x faster training
- Classification Accuracy Loss: 4.8% (97.2% → 92.4%)
- PCA Fitting Time: 0.003 seconds
- Transform Time: 0.001 seconds per sample
Tips for Success
- Standardization: Always standardize features before PCA
- Variance Threshold: Choose based on downstream task requirements
- Interpretability: First PCs often represent meaningful patterns
- Incremental PCA: Use for large datasets that don't fit in memory
- Kernel PCA: For non-linear dimensionality reduction
- Sparse PCA: When interpretability with few features is important
- Visualization: Use 2-3 PCs for visualization, more for ML
Example Scenarios
Scenario 1: Visualization (2 PCs)
- Purpose: Visualize 30D cancer data in 2D
- Variance Retained: 63.3%
- Result: Clear separation between malignant and benign
- Use Case: Exploratory analysis, presentations
Scenario 2: Noise Reduction (10 PCs)
- Purpose: Remove noisy features for ML model
- Variance Retained: 97.2%
- Accuracy Loss: 0.4% (minimal)
- Speedup: 2.4x faster training
- Use Case: Improve model generalization
Scenario 3: Feature Engineering (5 PCs)
- Purpose: Create compact feature set
- Variance Retained: 84.8%
- Accuracy Loss: 1.6%
- Speedup: 4.8x faster
- Use Case: Real-time inference, embedded systems
Troubleshooting
Problem: Low variance explained with few components
- Solution: Dataset may need more PCs, or use non-linear methods (t-SNE, UMAP)
Problem: Poor downstream task performance
- Solution: Increase n_components, try supervised dimensionality reduction (LDA)
Problem: First PC dominated by single feature
- Solution: Ensure proper feature scaling, check for outliers
Problem: Negative values in transformed data
- Solution: Normal behavior - PCA centers data at origin
Problem: Slow computation on large dataset
- Solution: Use randomized SVD solver, or IncrementalPCA
PCA vs Other Dimensionality Reduction Methods
| Method | Type | Speed | Preserves | Best For |
|---|---|---|---|---|
| PCA | Linear | Fast | Global structure | General use |
| t-SNE | Non-linear | Slow | Local structure | Visualization |
| UMAP | Non-linear | Medium | Both | Visualization + ML |
| LDA | Supervised | Fast | Class separation | Classification |
| Autoencoder | Non-linear | Slow | Complex patterns | Deep learning |
| ICA | Linear | Medium | Independence | Signal processing |
Understanding Principal Components
PC1 (44.3% variance)
- Interpretation: Overall tumor size and malignancy severity
- High values: Larger, more severe tumors
- Key features: Mean radius, perimeter, area, concavity
PC2 (19.0% variance)
- Interpretation: Tumor texture and shape irregularity
- High values: More irregular, textured tumors
- Key features: Mean texture, smoothness, symmetry
Next Steps
After performing PCA, you can:
- Use reduced features for faster ML training
- Create visualizations for exploratory analysis
- Apply to new data using fitted PCA transformer
- Combine with clustering (K-Means on PCs)
- Use for anomaly detection (reconstruction error)
- Try non-linear alternatives (Kernel PCA, t-SNE, UMAP)
- Perform feature selection based on loadings
- Build compressed data pipelines for production