Collaborative Filtering

Collaborative filtering is a common technique used in recommender systems that relies on past user interactions to predict future preferences. Unlike content-based filtering, which uses item attributes (like genre or duration), collaborative filtering operates on the principle that “past similar preferences can inform future preferences”.

The Core Concept

The fundamental intuition behind collaborative filtering is that if users agreed on items in the past, they will likely agree on items in the future. For example, in a hypothetical streaming service called “Statflix,” if User A and User B have rated several shows similarly, the system considers them “similar.” Consequently, if User B likes a new show that User A has not yet seen, the system will recommend that show to User A.

This process typically involves constructing a matrix where rows represent users and columns represent items (or vice versa). The values in the cells represent ratings or interactions.

Calculating Similarity: Cosine Similarity

To determine which users are similar to one another, the system must employ a mathematical metric. A standard metric used in collaborative filtering is cosine similarity.

Cosine similarity treats the preferences of each user as a vector in a high-dimensional space. The similarity between two users is calculated as the cosine of the angle between their respective vectors.

Geometric Interpretation: Imagine two vectors originating from the same point. If the vectors point in roughly the same direction, the angle between them is small. As the angle decreases, the cosine of that angle approaches 1. Therefore, vectors that are close together have a cosine similarity near 1, indicating high similarity between users.
Orthogonality: If the vectors are at a 90-degree angle (orthogonal), the cosine is 0. This indicates that the users’ preferences are unrelated.
Opposite Preferences: If vectors point in opposite directions (180 degrees), the cosine is -1, indicating dissimilar preferences.

Mathematically, the cosine similarity between two vectors $x$ and $y$ is defined as the inner product (dot product) of the vectors divided by the product of their lengths (norms):

$\cos(\omega) = \frac{\langle x,y \rangle}{|x| |y|}$

This formula normalizes the vectors, meaning the metric focuses on the orientation (the pattern of preferences) rather than the magnitude (e.g., how strictly a user rates on a scale).

Making Predictions

Once similarity scores are calculated, the system predicts a user’s rating for an unseen item using a weighted average.

Identify Neighbors: The system identifies other users who have rated the target item.
Weight Ratings: The ratings given by these other users are weighted by their similarity score to the active user. For example, if User A is very similar to User B (e.g., similarity 0.99) and less similar to User C (e.g., similarity 0.57), User B’s rating will have a much stronger influence on the prediction than User C’s.
Normalize: The sum of these weighted ratings is divided by the sum of the similarity scores to normalize the result, ensuring the predicted rating stays within the valid range (e.g., 1 to 5 stars).

Challenges

While intuitive, collaborative filtering faces specific implementation barriers:

Sparsity: In real-world systems with millions of users and items, most users have not rated most items. This results in a “sparse” matrix with many empty cells, making it difficult to find users with enough overlapping ratings to calculate reliable similarities.
Scalability: Calculating similarity scores for every pair of users or items can be computationally expensive as the number of users grows.
Cold Start: It is difficult to provide recommendations for new users who have no interaction history, as there is no data to compare with others.