Dokumentation (english)

Embeddings Similarity Recommendation

Semantic similarity recommendation using sentence transformers. Creates embeddings for item descriptions and recommends similar items based on cosine similarity.

Semantic similarity recommendation using sentence transformers. Creates embeddings for item descriptions and recommends similar items based on cosine similarity.

When to use:

  • Have rich textual item descriptions
  • Need semantic understanding beyond keyword matching
  • Want to capture meaning and context
  • Multi-language content

Strengths: Understands semantics and context, handles synonyms and paraphrasing, works across languages, dense representations Weaknesses: Requires good text data, computationally intensive, needs sufficient GPU/CPU resources, not personalized without interaction data

How it Works

Embeddings Similarity uses pre-trained transformer models (like BERT, Sentence-BERT) to convert item descriptions into dense vector representations (embeddings) that capture semantic meaning.

The process:

  1. Encode Items: Each item description is converted to a fixed-size embedding vector (e.g., 384 dimensions)
  2. User Profile: Create user embeddings by aggregating embeddings of items they've interacted with
  3. Similarity Search: Find items with embeddings most similar (cosine similarity) to user profile
  4. Ranking: Return top-K most similar items

Key Advantage: Unlike TF-IDF which only matches keywords, embeddings understand that "smartphone" and "mobile phone" are semantically similar, or that "luxury hotel" and "5-star accommodation" convey similar meaning.

Parameters

Feature Configuration

Feature Columns (required) List of columns to use: must include user_id, item_id, and content.

User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.

Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.

Content Column (default: "description", required) Name of the column containing item descriptions or text content. This is encoded into embeddings.

  • Product descriptions, article text, movie plots, job descriptions
  • Longer, richer text generally produces better embeddings
  • Can concatenate multiple fields (title + description + metadata)

Model-Specific Parameters

Embedding Model (default: "sentence-transformers/all-MiniLM-L6-v2") Name of the pre-trained sentence transformer model from HuggingFace.

Popular Models:

  • all-MiniLM-L6-v2: Fast, 384 dimensions, general purpose (default)
  • all-mpnet-base-v2: Better quality, 768 dimensions, slower
  • multi-qa-mpnet-base-dot-v1: Optimized for question-answering
  • all-MiniLM-L12-v2: Balanced speed and quality
  • paraphrase-multilingual-mpnet-base-v2: Multi-language support

Model Selection Guide:

  • Fast/Resource-constrained: all-MiniLM-L6-v2 (default)
  • Best Quality: all-mpnet-base-v2
  • Multi-language: paraphrase-multilingual-*
  • Domain-specific: Fine-tuned models for your domain

Top-K Recommendations (default: 10) Number of items to recommend for each user.

  • 5-10: Focused recommendations
  • 10-20: Standard recommendation lists
  • 20-50: For exploration and diversity

Configuration Tips

Dataset Size Considerations

  • Small (<10k items): Fast, any model works
  • Medium (10k-100k): Good performance, use default model
  • Large (100k-1M): Consider using smaller embedding model (MiniLM-L6)
  • Very Large (>1M): Use approximate nearest neighbor search (ANN)

Parameter Tuning Guidance

Choosing Embedding Model:

  1. Start with default: all-MiniLM-L6-v2 (good balance)
  2. If quality insufficient: Try all-mpnet-base-v2
  3. If multi-language: Use paraphrase-multilingual model
  4. If domain-specific: Search HuggingFace for domain models (medical, legal, scientific)
  5. If speed critical: Use all-MiniLM-L6-v2 or smaller

Optimization Strategies:

  • Pre-compute and cache item embeddings (they don't change)
  • Batch process for efficiency
  • Use GPU if available for faster encoding
  • Implement approximate nearest neighbor (ANN) for large catalogs
  • Consider quantization for memory efficiency

When to Choose This Over Alternatives

  • vs. TF-IDF: Choose this for semantic understanding and better quality
  • vs. Collaborative Filtering: Choose this for cold start and content-rich items
  • vs. Item-Based KNN: Choose this when content matters more than behavior
  • vs. Hybrid: Choose this when interaction data is very sparse
  • Best when: Rich textual content, semantic understanding needed, multi-language

Common Issues and Solutions

Cold Start Problem (New Users)

Issue: New users have no interaction history. Solution:

  • Collect initial preferences through questionnaire
  • Show popular or trending items initially
  • Use demographic or contextual signals
  • Build user profile from first few interactions

Poor Content Quality

Issue: Item descriptions are too short, generic, or low-quality. Solution:

  • Enrich with additional metadata
  • Combine multiple text fields
  • Use user reviews or tags if available
  • Consider fine-tuning embeddings on your domain
  • Fall back to collaborative filtering when content is insufficient

Computational Cost

Issue: Encoding large text corpus is slow or expensive. Solution:

  • Use smaller/faster model (all-MiniLM-L6-v2)
  • Pre-compute and cache embeddings
  • Use GPU acceleration
  • Batch processing for efficiency
  • Update embeddings only for new/changed items

Memory Requirements

Issue: Storing embeddings for millions of items requires too much memory. Solution:

  • Use smaller embedding model (fewer dimensions)
  • Apply quantization (float16 or int8)
  • Use approximate nearest neighbor (FAISS, Annoy)
  • Stream processing instead of loading all at once

Limited Diversity

Issue: All recommendations semantically too similar. Solution:

  • Apply diversity-aware ranking (MMR)
  • Combine with collaborative filtering (Hybrid)
  • Use category diversification
  • Add serendipity factor

Multi-language Challenges

Issue: Content in multiple languages, embeddings don't work well. Solution:

  • Use multi-language sentence transformer
  • Translate all content to single language
  • Train separate models per language
  • Use language-specific models

Example Use Cases

Academic Paper Recommendations

Scenario: Research platform with 500k papers, need to recommend relevant papers based on abstracts Configuration:

  • Model: all-mpnet-base-v2 (high quality for scientific text)
  • Content: title + abstract + keywords
  • Top-20 recommendations Why: Rich academic content, semantic understanding crucial, technical terminology

Job Matching Platform

Scenario: Job board with 200k job postings, matching candidates to jobs Configuration:

  • Model: all-MiniLM-L6-v2 (balanced)
  • Content: job_title + description + requirements + skills
  • Top-15 recommendations
  • Combine with candidate's resume/profile Why: Semantic matching of skills and requirements, understands job descriptions

Multi-language News Recommendations

Scenario: International news platform with content in 10 languages Configuration:

  • Model: paraphrase-multilingual-mpnet-base-v2
  • Content: article_title + article_text + category
  • Top-20 recommendations Why: Multi-language support, semantic understanding of news topics, cross-language recommendations

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items