Recommendation Systems

Recommendation systems predict what users might like based on past behavior, item attributes, or patterns from similar users. They power product suggestions, content feeds, and search rankings across most digital platforms.

The Problem

Unlike traditional supervised learning, recommendations deal with:

Sparse data: Most users interact with only a tiny fraction of available items
Implicit signals: Clicks, views, and time spent rather than explicit ratings
Scale: Millions of users and items, but you need fast responses
Ranking: The order matters more than exact scores

The goal isn't predicting a rating—it's surfacing the right items in the right order.

Types of Data

Explicit feedback: Users tell you directly—ratings, likes, thumbs up/down. Clear but rare.

Implicit feedback: Clicks, purchases, watch time, skips. Noisy but abundant. A click doesn't guarantee interest, but it's better than nothing.

Item features: Descriptions, categories, tags, embeddings from text or images.

User features: Demographics, location, device, past behavior.

Most systems run on implicit feedback because that's what users generate naturally.

Evaluation Metrics

Standard classification metrics don't work well here. Accuracy on predicted ratings misses the point—you care about ranking quality.

Precision@K and Recall@K

Precision@K: Of the K items you recommended, how many were relevant?

\text{Precision@K} = \frac{\text{relevant items in top K}}{K}

Recall@K: Of all relevant items, how many did you surface in the top K?

\text{Recall@K} = \frac{\text{relevant items in top K}}{\text{total relevant items}}

Both treat all positions equally—whether something is rank 1 or rank 10 doesn't matter. That's unrealistic.

Mean Average Precision (MAP)

MAP rewards putting relevant items higher in the list. For each relevant item at position i, compute precision up to that point, then average:

\text{AP} = \frac{1}{R} \sum_{i=1}^{n} P(i) \cdot \text{rel}(i)

Where R is total relevant items, P(i) is precision at position i, and rel(i) is 1 if item i is relevant.

Normalized Discounted Cumulative Gain (NDCG)

NDCG discounts lower positions with a logarithm, reflecting how users lose interest scrolling down:

\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

Normalize by dividing by the ideal DCG (perfect ranking):

\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}

NDCG values range from 0 to 1. Widely used because it captures position sensitivity and allows comparisons across users.

Beyond Accuracy

Coverage: What fraction of the catalog gets recommended? Low coverage means you're showing the same items to everyone.

Diversity: How different are the recommended items from each other? High diversity prevents recommending 10 similar items.

Serendipity: How often do you recommend items the user likes but wouldn't find on their own?

Novelty: Are you surfacing items users haven't seen before? Measured by recommending less popular items.

Real systems balance all these. Pure accuracy optimization leads to boring, homogeneous recommendations.

Optimizing recommendations involves tradeoffs: maximizing accuracy often comes at the cost of diversity. Pure accuracy optimization recommends safe, similar items. High diversity surfaces varied items but may sacrifice relevance. Systems need deliberate design choices to balance these competing objectives based on business goals.

Popularity-Based

Recommend what most people interact with. Simple baseline.

Strengths:

No personalization needed—works for cold-start users
Stable and easy to compute
Always has something to show

Weaknesses:

Zero personalization—everyone sees the same list
Reinforces popularity bias (rich get richer)
Long-tail items never surface

Improvements:

Segment by region, age, or category for local popularity
Apply temporal decay—weight recent interactions more
Use as a retrieval baseline, not the final ranker

Content-Based Filtering

Recommend items similar to what the user already liked, based on item features.

How it works:

Represent items as feature vectors (TF-IDF, embeddings, metadata)
Build a user profile by averaging feature vectors of items they liked
Rank new items by similarity to the user profile (cosine similarity)

Strengths:

Handles new items well (item cold-start)—as long as they have features
Transparent—easy to explain why something was recommended
Doesn't need data from other users

Weaknesses:

Over-specialization—users get stuck in a filter bubble
Ignores collaborative signals (what similar users like)
Requires good item features

Example: User reads sci-fi books → profile leans toward sci-fi features → recommend more sci-fi. Works, but limits discovery.

Collaborative Filtering

Learn from collective behavior. If two users behave similarly, recommend items one liked to the other.

User-Based

Find users with similar interaction patterns using cosine similarity:

\text{sim}(u, v) = \frac{\mathbf{r}_u \cdot \mathbf{r}_v}{\|\mathbf{r}_u\| \|\mathbf{r}_v\|}

Where rᵤ and rᵥ are interaction vectors. Then recommend items that similar users liked.

Problem: User vectors are sparse. Finding reliable neighbors is hard with limited overlap.

Item-Based

Compare items instead of users. Two items are similar if many of the same users interacted with both:

\text{sim}(i, j) = \frac{\mathbf{r}_i \cdot \mathbf{r}_j}{\|\mathbf{r}_i\| \|\mathbf{r}_j\|}

Score items for a user by weighted sum of similarities:

\text{score}(u, i) = \sum_{j \in \text{Items}(u)} \text{sim}(i, j)

Why better: Item vectors are denser (popular items have many interactions), so similarities are more stable. Amazon famously uses this.

Strengths:

No item features needed
Learns from collective intelligence
Scales well (pre-compute item similarities)

Weaknesses:

Cold-start for new items (no interactions yet)
Cold-start for new users (too few interactions)
Requires enough data to find patterns

Matrix Factorization

Assume user-item interactions are explained by latent factors. Represent users and items as low-dimensional vectors, then predict interactions as dot products.

Given user-item matrix R (sparse), find user matrix U and item matrix V such that:

\hat{r}_{ui} = \mathbf{u}_u^T \mathbf{v}_i

Learn by minimizing:

\min_{U,V} \sum_{(u,i) \in \Omega} (r_{ui} - \mathbf{u}_u^T \mathbf{v}_i)^2 + \lambda (\|\mathbf{u}_u\|^2 + \|\mathbf{v}_i\|^2)

Where Ω is the set of observed interactions.

Strengths:

Handles sparsity—learns even with minimal overlap
Latent factors capture hidden preferences (genres, styles, themes)
Scalable with efficient training (ALS, SGD)

Recommender systems operate in a latent preference space where recommendations emerge from geometric relationships rather than explicit rules. Users and items are positioned based on learned preferences and characteristics, with nearby points indicating similarity.

Weaknesses:

Cold-start remains (new users/items have no learned vectors)
Assumes preferences decompose into small number of factors

Training methods:

Alternating Least Squares (ALS): Fix one matrix, solve for the other. Parallelizes well.
Stochastic Gradient Descent (SGD): Update vectors incrementally based on gradients. Flexible for custom loss functions.
SVD++: Extends basic matrix factorization with implicit feedback signals.

Choosing an Approach

No single method wins everywhere. Consider:

Data availability:

Little data → Popularity or content-based
Rich interaction history → Collaborative filtering or matrix factorization

Cold-start severity:

Many new users → Popularity baseline + content-based
Many new items → Content-based (needs item features)

Scale:

Small catalog → Memory-based methods (user/item kNN)
Large catalog → Embedding-based (matrix factorization, neural models)

Business goals:

Maximize clicks → Optimize CTR, precision
Discovery → Balance novelty, diversity, serendipity
Revenue → Optimize conversion rate

In practice: Two-stage architecture:

Retrieval: Fast, broad filtering (popularity, embeddings, simple collaborative filtering). Get hundreds of candidates.
Ranking: Expensive models (gradient boosting, neural rankers) on candidates. Order by relevance.

This balances latency and accuracy.

Practical Considerations

Implicit feedback: Weight interactions differently (purchases > clicks > views). Use confidence weighting in loss functions.

Temporal effects: Recent behavior matters more. Use recency weighting or sequence models (RNNs, Transformers).

Context: Time of day, device, location affect preferences. Add context features to models.

Feedback loops: Recommending popular items makes them more popular. Break cycles with exploration (randomization, diversity constraints).

A/B testing: Offline metrics don't predict online performance. Test changes with live traffic splits.

Recommendation Systems

The Problem

Types of Data

Evaluation Metrics

Precision@K and Recall@K

Mean Average Precision (MAP)

Normalized Discounted Cumulative Gain (NDCG)

Beyond Accuracy

Popularity-Based

Content-Based Filtering

Collaborative Filtering

User-Based

Item-Based

Matrix Factorization

Choosing an Approach

Practical Considerations

On this page

Command Palette