Matrix Factorization (sklearn TruncatedSVD)
Matrix factorization using sklearn's TruncatedSVD for collaborative filtering. Simpler and faster alternative to scipy SVD with sklearn API.
Matrix factorization using sklearn's TruncatedSVD for collaborative filtering. Simpler and faster alternative to scipy SVD with sklearn API.
When to use:
- Need fast collaborative filtering
- Have implicit feedback (clicks, views, purchases)
- Want simpler configuration than full SVD
- Prefer sklearn's API and ecosystem
Strengths: Very fast, simple to use, works well with implicit feedback, sklearn integration Weaknesses: Less sophisticated than full SVD, no built-in regularization, fewer tuning options
How it Works
TruncatedSVD performs dimensionality reduction on the user-item interaction matrix using Singular Value Decomposition. It decomposes the sparse interaction matrix into three matrices, keeping only the top k components (latent factors).
Unlike the full SVD implementation, TruncatedSVD is optimized for sparse matrices and uses randomized algorithms for faster computation. It's particularly effective for implicit feedback where you have presence/absence of interactions rather than ratings.
Key Concept: Items and users that co-occur frequently in the interaction matrix will have similar latent factor representations, making them good candidates for recommendation.
Parameters
Feature Configuration
Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally rating.
User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.
Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.
Rating Column (optional) Name of the column containing ratings. If provided, uses rating weights. If not provided, treats all interactions equally (implicit feedback).
Model-Specific Parameters
Number of Components (default: 50) Number of latent components (dimensions) to keep after decomposition. Controls model capacity.
- 10-30: Minimal model, very fast, may underfit
- 30-50: Good balance for most use cases (default)
- 50-100: More detailed patterns, slower
- 100+: For very large, complex datasets
Top-K Recommendations (default: 10) Number of items to recommend for each user.
- 5-10: Focused recommendations
- 10-20: Standard recommendation lists
- 20-50: For exploration and discovery
Configuration Tips
Dataset Size Considerations
- Small (<10k interactions): Use 20-30 components, may not have enough data
- Medium (10k-100k): Use 30-50 components, ideal range
- Large (100k-1M): Use 50-80 components, good performance
- Very Large (>1M): Use 80-100 components, excellent scaling
Parameter Tuning Guidance
- Start with defaults: 50 components works well for most cases
- Increase components: If recommendations seem too generic
- Decrease components: If training is slow or results are noisy
- Monitor metrics: Track Hit Rate@K, NDCG, and Precision@K
- Compare with baselines: Test against popularity-based recommendations
When to Choose This Over Alternatives
- vs. scipy SVD: Choose this for faster training and implicit feedback
- vs. Item-Based KNN: Choose this for discovering latent patterns vs. direct similarity
- vs. User-Based KNN: Choose this for better scalability
- vs. Content-Based: Choose this when you have sufficient interaction data
- vs. BERT4Rec: Choose this for simpler, faster, non-sequential recommendations
Common Issues and Solutions
Cold Start Problem
Issue: Cannot recommend to new users or recommend new items. Solution:
- Use popularity-based recommendations for new users
- Use content-based features for new items
- Combine with Hybrid model
- Collect quick feedback through initial questionnaire
Insufficient Interactions
Issue: Too few interactions lead to poor recommendations. Solution:
- Reduce number of components (try 20-30)
- Combine multiple interaction types (views, clicks, purchases)
- Use implicit feedback to increase data density
- Consider switching to content-based approach
All Recommendations Similar
Issue: Model only recommends popular items or similar items. Solution:
- Increase number of components (try 70-100)
- Apply diversity post-processing
- Use hybrid approach combining multiple signals
- Filter out already-interacted items
Poor Performance on Test Set
Issue: Low precision or hit rate metrics. Solution:
- Ensure proper temporal split (train on past, test on future)
- Check data quality (duplicates, invalid interactions)
- Increase number of components
- Consider that implicit feedback is inherently noisy
Slow Inference
Issue: Generating recommendations takes too long. Solution:
- Reduce number of components
- Pre-compute item similarities
- Cache user representations
- Use approximate nearest neighbor search
Example Use Cases
E-commerce Purchase History
Scenario: Online store with 500k users and 2M purchase interactions Configuration:
- 60 components
- Top-10 recommendations
- No rating column (implicit feedback from purchases) Why: Large dataset with implicit feedback, need fast recommendations
Content Platform Views
Scenario: Video platform with 1M users viewing 100k videos Configuration:
- 80 components
- Top-20 recommendations
- Use view count as implicit rating weight Why: Very large dataset with implicit feedback, need diversity
Mobile App Engagement
Scenario: Mobile app with 200k users and item click data Configuration:
- 40 components
- Top-15 recommendations
- Binary interaction (clicked or not) Why: Medium dataset, fast recommendations needed for mobile, implicit feedback