Dokumentation (english)

Matrix Factorization (sklearn TruncatedSVD)

Matrix factorization using sklearn's TruncatedSVD for collaborative filtering. Simpler and faster alternative to scipy SVD with sklearn API.

Matrix factorization using sklearn's TruncatedSVD for collaborative filtering. Simpler and faster alternative to scipy SVD with sklearn API.

When to use:

  • Need fast collaborative filtering
  • Have implicit feedback (clicks, views, purchases)
  • Want simpler configuration than full SVD
  • Prefer sklearn's API and ecosystem

Strengths: Very fast, simple to use, works well with implicit feedback, sklearn integration Weaknesses: Less sophisticated than full SVD, no built-in regularization, fewer tuning options

How it Works

TruncatedSVD performs dimensionality reduction on the user-item interaction matrix using Singular Value Decomposition. It decomposes the sparse interaction matrix into three matrices, keeping only the top k components (latent factors).

Unlike the full SVD implementation, TruncatedSVD is optimized for sparse matrices and uses randomized algorithms for faster computation. It's particularly effective for implicit feedback where you have presence/absence of interactions rather than ratings.

Key Concept: Items and users that co-occur frequently in the interaction matrix will have similar latent factor representations, making them good candidates for recommendation.

Parameters

Feature Configuration

Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally rating.

User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.

Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.

Rating Column (optional) Name of the column containing ratings. If provided, uses rating weights. If not provided, treats all interactions equally (implicit feedback).

Model-Specific Parameters

Number of Components (default: 50) Number of latent components (dimensions) to keep after decomposition. Controls model capacity.

  • 10-30: Minimal model, very fast, may underfit
  • 30-50: Good balance for most use cases (default)
  • 50-100: More detailed patterns, slower
  • 100+: For very large, complex datasets

Top-K Recommendations (default: 10) Number of items to recommend for each user.

  • 5-10: Focused recommendations
  • 10-20: Standard recommendation lists
  • 20-50: For exploration and discovery

Configuration Tips

Dataset Size Considerations

  • Small (<10k interactions): Use 20-30 components, may not have enough data
  • Medium (10k-100k): Use 30-50 components, ideal range
  • Large (100k-1M): Use 50-80 components, good performance
  • Very Large (>1M): Use 80-100 components, excellent scaling

Parameter Tuning Guidance

  1. Start with defaults: 50 components works well for most cases
  2. Increase components: If recommendations seem too generic
  3. Decrease components: If training is slow or results are noisy
  4. Monitor metrics: Track Hit Rate@K, NDCG, and Precision@K
  5. Compare with baselines: Test against popularity-based recommendations

When to Choose This Over Alternatives

  • vs. scipy SVD: Choose this for faster training and implicit feedback
  • vs. Item-Based KNN: Choose this for discovering latent patterns vs. direct similarity
  • vs. User-Based KNN: Choose this for better scalability
  • vs. Content-Based: Choose this when you have sufficient interaction data
  • vs. BERT4Rec: Choose this for simpler, faster, non-sequential recommendations

Common Issues and Solutions

Cold Start Problem

Issue: Cannot recommend to new users or recommend new items. Solution:

  • Use popularity-based recommendations for new users
  • Use content-based features for new items
  • Combine with Hybrid model
  • Collect quick feedback through initial questionnaire

Insufficient Interactions

Issue: Too few interactions lead to poor recommendations. Solution:

  • Reduce number of components (try 20-30)
  • Combine multiple interaction types (views, clicks, purchases)
  • Use implicit feedback to increase data density
  • Consider switching to content-based approach

All Recommendations Similar

Issue: Model only recommends popular items or similar items. Solution:

  • Increase number of components (try 70-100)
  • Apply diversity post-processing
  • Use hybrid approach combining multiple signals
  • Filter out already-interacted items

Poor Performance on Test Set

Issue: Low precision or hit rate metrics. Solution:

  • Ensure proper temporal split (train on past, test on future)
  • Check data quality (duplicates, invalid interactions)
  • Increase number of components
  • Consider that implicit feedback is inherently noisy

Slow Inference

Issue: Generating recommendations takes too long. Solution:

  • Reduce number of components
  • Pre-compute item similarities
  • Cache user representations
  • Use approximate nearest neighbor search

Example Use Cases

E-commerce Purchase History

Scenario: Online store with 500k users and 2M purchase interactions Configuration:

  • 60 components
  • Top-10 recommendations
  • No rating column (implicit feedback from purchases) Why: Large dataset with implicit feedback, need fast recommendations

Content Platform Views

Scenario: Video platform with 1M users viewing 100k videos Configuration:

  • 80 components
  • Top-20 recommendations
  • Use view count as implicit rating weight Why: Very large dataset with implicit feedback, need diversity

Mobile App Engagement

Scenario: Mobile app with 200k users and item click data Configuration:

  • 40 components
  • Top-15 recommendations
  • Binary interaction (clicked or not) Why: Medium dataset, fast recommendations needed for mobile, implicit feedback

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items