Recommendation Systems
Systems that predict user preferences and surface relevant items
Recommendation systems predict what users might like based on past behavior, item attributes, or patterns from similar users. They power product suggestions, content feeds, and search rankings across most digital platforms.
The Problem
Unlike traditional supervised learning, recommendations deal with:
- Sparse data: Most users interact with only a tiny fraction of available items
- Implicit signals: Clicks, views, and time spent rather than explicit ratings
- Scale: Millions of users and items, but you need fast responses
- Ranking: The order matters more than exact scores
The goal isn't predicting a rating—it's surfacing the right items in the right order.
Types of Data
Explicit feedback: Users tell you directly—ratings, likes, thumbs up/down. Clear but rare.
Implicit feedback: Clicks, purchases, watch time, skips. Noisy but abundant. A click doesn't guarantee interest, but it's better than nothing.
Item features: Descriptions, categories, tags, embeddings from text or images.
User features: Demographics, location, device, past behavior.
Most systems run on implicit feedback because that's what users generate naturally.
Evaluation Metrics
Standard classification metrics don't work well here. Accuracy on predicted ratings misses the point—you care about ranking quality.
Precision@K and Recall@K
Precision@K: Of the K items you recommended, how many were relevant?
Recall@K: Of all relevant items, how many did you surface in the top K?
Both treat all positions equally—whether something is rank 1 or rank 10 doesn't matter. That's unrealistic.
Mean Average Precision (MAP)
MAP rewards putting relevant items higher in the list. For each relevant item at position i, compute precision up to that point, then average:
Where R is total relevant items, P(i) is precision at position i, and rel(i) is 1 if item i is relevant.
Normalized Discounted Cumulative Gain (NDCG)
NDCG discounts lower positions with a logarithm, reflecting how users lose interest scrolling down:
Normalize by dividing by the ideal DCG (perfect ranking):
NDCG values range from 0 to 1. Widely used because it captures position sensitivity and allows comparisons across users.
Beyond Accuracy
Coverage: What fraction of the catalog gets recommended? Low coverage means you're showing the same items to everyone.
Diversity: How different are the recommended items from each other? High diversity prevents recommending 10 similar items.
Serendipity: How often do you recommend items the user likes but wouldn't find on their own?
Novelty: Are you surfacing items users haven't seen before? Measured by recommending less popular items.
Real systems balance all these. Pure accuracy optimization leads to boring, homogeneous recommendations.
Optimizing recommendations involves tradeoffs: maximizing accuracy often comes at the cost of diversity. Pure accuracy optimization recommends safe, similar items. High diversity surfaces varied items but may sacrifice relevance. Systems need deliberate design choices to balance these competing objectives based on business goals.
Popularity-Based
Recommend what most people interact with. Simple baseline.
Strengths:
- No personalization needed—works for cold-start users
- Stable and easy to compute
- Always has something to show
Weaknesses:
- Zero personalization—everyone sees the same list
- Reinforces popularity bias (rich get richer)
- Long-tail items never surface
Improvements:
- Segment by region, age, or category for local popularity
- Apply temporal decay—weight recent interactions more
- Use as a retrieval baseline, not the final ranker
Content-Based Filtering
Recommend items similar to what the user already liked, based on item features.
How it works:
- Represent items as feature vectors (TF-IDF, embeddings, metadata)
- Build a user profile by averaging feature vectors of items they liked
- Rank new items by similarity to the user profile (cosine similarity)
Strengths:
- Handles new items well (item cold-start)—as long as they have features
- Transparent—easy to explain why something was recommended
- Doesn't need data from other users
Weaknesses:
- Over-specialization—users get stuck in a filter bubble
- Ignores collaborative signals (what similar users like)
- Requires good item features
Example: User reads sci-fi books → profile leans toward sci-fi features → recommend more sci-fi. Works, but limits discovery.
Collaborative Filtering
Learn from collective behavior. If two users behave similarly, recommend items one liked to the other.
User-Based
Find users with similar interaction patterns using cosine similarity:
Where rᵤ and rᵥ are interaction vectors. Then recommend items that similar users liked.
Problem: User vectors are sparse. Finding reliable neighbors is hard with limited overlap.
Item-Based
Compare items instead of users. Two items are similar if many of the same users interacted with both:
Score items for a user by weighted sum of similarities:
Why better: Item vectors are denser (popular items have many interactions), so similarities are more stable. Amazon famously uses this.
Strengths:
- No item features needed
- Learns from collective intelligence
- Scales well (pre-compute item similarities)
Weaknesses:
- Cold-start for new items (no interactions yet)
- Cold-start for new users (too few interactions)
- Requires enough data to find patterns
Matrix Factorization
Assume user-item interactions are explained by latent factors. Represent users and items as low-dimensional vectors, then predict interactions as dot products.
Given user-item matrix R (sparse), find user matrix U and item matrix V such that:
Learn by minimizing:
Where Ω is the set of observed interactions.
Strengths:
- Handles sparsity—learns even with minimal overlap
- Latent factors capture hidden preferences (genres, styles, themes)
- Scalable with efficient training (ALS, SGD)
Recommender systems operate in a latent preference space where recommendations emerge from geometric relationships rather than explicit rules. Users and items are positioned based on learned preferences and characteristics, with nearby points indicating similarity.
Weaknesses:
- Cold-start remains (new users/items have no learned vectors)
- Assumes preferences decompose into small number of factors
Training methods:
- Alternating Least Squares (ALS): Fix one matrix, solve for the other. Parallelizes well.
- Stochastic Gradient Descent (SGD): Update vectors incrementally based on gradients. Flexible for custom loss functions.
- SVD++: Extends basic matrix factorization with implicit feedback signals.
Choosing an Approach
No single method wins everywhere. Consider:
Data availability:
- Little data → Popularity or content-based
- Rich interaction history → Collaborative filtering or matrix factorization
Cold-start severity:
- Many new users → Popularity baseline + content-based
- Many new items → Content-based (needs item features)
Scale:
- Small catalog → Memory-based methods (user/item kNN)
- Large catalog → Embedding-based (matrix factorization, neural models)
Business goals:
- Maximize clicks → Optimize CTR, precision
- Discovery → Balance novelty, diversity, serendipity
- Revenue → Optimize conversion rate
In practice: Two-stage architecture:
- Retrieval: Fast, broad filtering (popularity, embeddings, simple collaborative filtering). Get hundreds of candidates.
- Ranking: Expensive models (gradient boosting, neural rankers) on candidates. Order by relevance.
This balances latency and accuracy.
Practical Considerations
Implicit feedback: Weight interactions differently (purchases > clicks > views). Use confidence weighting in loss functions.
Temporal effects: Recent behavior matters more. Use recency weighting or sequence models (RNNs, Transformers).
Context: Time of day, device, location affect preferences. Add context features to models.
Feedback loops: Recommending popular items makes them more popular. Break cycles with exploration (randomization, diversity constraints).
A/B testing: Offline metrics don't predict online performance. Test changes with live traffic splits.