IR Lab - Guy Lebanon

C/Matlab Toolkit for Collaborative Filtering

Guy Lebanon
Language Technologies Institute
Carnegie Mellon University
August 2003

Information Retrieval Lab, 11-742
Lab instructor: Yiming Yang

Abstract
Introduction
System Description and Code
Tutorial
Experiments
Bibliography

Abstract

This toolkit is a set of C and Matlab functions implementing several methods of collaborative filtering (CF). In addition to implementing several algorithm proposed in the recent literature, we also supply functions for loading, handling and evaluating collaborative filtering methods. The main computational modules are implemented in C as mex files that are called from a Matlab environment thus achieving efficient computation of the bottlenck routines.

Introduction

Collaborative filtering (CF) is the task of predicting the preferences of a user (called the active user) for items unobserved by him. The preferences are predicted based on the active user preference of a set of observed items and preference of other users. Note that the item content does not play a role in the prediction. The prediction are made on the basis of preference information. Adding content information as features to the prediction algorithm may improve the result, but in this toolkit we decided to focus on the traditional collaborative filtering methods that does not employ content features.

Breese et al. (1998) divides the approaches to collaborative filtering to the following two classes of algorithms

Memory based algorithms. Popular examples are Pearson correlation coefficient, vector similarity and various extensions of them. These algorithms are usually non-probabilistic and may be seen as the ranking-analogue of nearest neighbor classifier. These method go back to the original setup of the collaborative filtering task.
Model based algorithms. Popular examples are Bayesian networks, dependency networks, personality diagnosis etc. These algorithms build a statistical model that predicts the preference of the user.

In this toolkit we implement the memory based algorithms of Pearson correlation coefficient, vector similarity, default voting and case amplification (see Breese et al, 1998 for a description). We also implement the model based method of personality diagnosis proposed by Pennock et al (2000). In addition, we implement routines for testing the method and evaluating them using different metrics and routines for handling the eachmovie data which is the canonical dataset for ranked collaborative filtering experiments.

A more detailed description of the collaborative filtering task may be found here.

System Description and Code

The toolkit is intended to run from a Matlab environment. It is platform independent and should work on any of the recent Matlab versions (6-6.5). The computationally intensive routines were written in C as mex files and called from Matlab using the Mathworks application program interface. The mex files are available in a precompiled form for PC (.dll files). If the toolkit is used on a different platform, the mex files have to be compiled using Matlab's mex command.

Preparing the software:

Download all the available files (available as a zip file) into a single directory.
Recompile the mex files if needed using the command "mex -O computePosteriorPD.c" and "mex -O memoryBasedModels.c". The -O flag indicates the optimization level.
You are now ready to start using the software.

Software functionality

eachMovieReader.m - loads EachMovie data from the file Vote.txt into a Matlab sparse array.
splitUsers.m - splits (randomly) the sparse array output by eachMovieReader.m into two sparse arrays representing the votes of the active users and the other users.
sparseMat2CellVec.m converts a sparse array representing votes to a cell array.
memoryBasedModels.c computes the similarity matrix between the active users and the other users. The function supports the following similarity measures: Pearson correlation, vector similarity, default voting, case amplification (see Breese et al., 1998). The precompiled version for PC is memoryBasedModels.dll.
predictPreferenceMemBased.m computes the predicted preference of the items for an active user using the similarity matrix output by memoryBasedModels.c.
computePosteriorPD.c computes the posterior and mean posterior for the predicted value of a specific item for the active user using the personality diagnosis method (see Pennock et al., 2000). Precompiled version for windows is computePosteriorPD.dll.
EvalAvgEachMovie.m evaluates the prediction method that predicts the preference for all items as the average preference of the user (based on the training data).
EvalConstEachMovie.m evaluates the prediction method that predicts the mean preference for all items for all users.
EvalMemBasedtEachMovie.m evaluates the memory based prediction methods.
EvalPDEachMovie.m evaluates the memory based prediction methods.
rankedEvalCF.m a ranked evaluation method that generalizes the evaluation method proposed by in Heckerman et al., 2000 to multiple preference values.
eachMovieComparison.m manages an experiment that compares the performance of different CF methods on the EachMovie dataset (inner function).
eachMovieComparisonScript.m manages an experiment that compares the performance of different CF methods on the EachMovie dataset (outer script).

For more information on using the toolkit read the Tutorial.

Experiments

Preference are more often than not unreported

Using the spy command and the sparse matrix output from eachMovieFileReader.m it is possible to visualize the missing information pattern across the users. The following image shows a blue dote when the appropriate user (row) reported a preference on the appropriate item (column). The rows represent the first 600 users and the column the items that the first 600 users reported their preferences.

Power law behavior of the number of users reporting preference of each item and the number of items each user reports

The histogram of the number of users reporting the preference of the different items exhibit a linear trend in the log-log scale. The x axis represent how times items were reported and the y axis represent number of items that fall in this category. 10 bins were used in generating the histogram.

The histogram of the number of items the different users report exhibit a linear trend in the log-log scale. The x axis represents number of users reporting the same items and the y axis representing number of occurrences that fall in this category. 10 bins were used in generating the histogram.

Rank-Value plots

The following two graphs show rank-value graphs for the number of votes the different items have (first figure) and the number of items different users report.

statistics for the graphs above were obtained from analyzing the first 1000 users.

Performance comparison of different collaborative filtering strategies

We compared the performance of three different CF systems: Pearson correlation coefficient, vector similarity and personality diagnosis (using standard deviation parameter of 0.7). An additional baseline of predicting a average performance for all items was also considered. The evaluation measures are mean absolute deviation, mean square error and ranked evaluation measure which measures the expected true preference of the chosen item when the probability to choose a recommended item decays exponentially with its location in a sorted list of recommendations. This is a generalization of the measure in Heckerman et al., 2000 for multiple preference values.

	Pearson correlation	vector similarity	personality diagnosis	constant mean value recommendation
mean absolute deviation (lower=better)	1.0243	1.0090	1.1863	1.4738
mean square error (lower=better)	1.7664	1.7219	2.0185	2.7849
ranked evaluation (higher=better)	4.8426	4.8254	4.7676	4.3562

The experiments performed showed that Pearson correlation performed almost as well as vector similarity but personality diagnosis lags somewhat behind (yet still outperforms the trivial recommender of mean preference). Personality diagnosis has a free parameter - the standard deviation of the Gaussian model - whose optimization may improve the model's performance.

Bibliography

[1] J. S. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Uncertainty in Artificial Intelligence, 1998.

[2] D. M. Pennock, E. Horvitz, S. Lawrence and C. L. Giles. Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-Based approach. Uncertainty in Artificial Intelligence, 2000.

[3] C. M. Kadie, C. Meek and D. Heckerman. CFW: A Collaborative Filtering System Using Posteriors Over Weights of Evidence. Uncertainty in Artificial Intelligence, 2002.

[4] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, C. Kadie. Dependency Networks for Inference, Collaborative Filtering, and Data Visualization. Journal of Machine Learning Research 1, 2000.

Guy Lebanon, Last modified: August, 2003.