How to get an RMSE of 0.8937 in the NetFlix Challenge


The current NetFlix leader has an RMSE of 0.8847, which my approach won't be able to beat no matter how much computing power I throw at it. There is therefore no longer a reason for me to keep my approach secret.

I did my calculations using Python and a small amount of inline C++, using the Numpy and Weave libraries.

My approach is based on simonfunk's SVD method. That is, we predict a user's rating of a movie by taking the dot product of two vectors, one corresponding to the movie and one corresponding to the user.

My twist on this approach was to use a Bayesian sampler. Instead of seeking a single best model (set of vectors for movies and users), we sample from the posterior distribution of all possible models. For each sample, rating predictions are made, and the final prediction is the average of these sample predictions.

The sampler used was a Gibbs sampler. Initial movie and user vectors take random values. To produce a new sample one iterates:

The first few iterations need to be discarded.

An Empirical Bayesian conjugate prior was used. That is, when sampling each movie vector we have an apriori expectation that the vector will have a multivariate Gaussian distribution estimated using all current movie vectors. We then examine each known rating for that movie in turn, updating our beliefs appropriately (thus shrinking the multivariate Gaussian distribution). A sample is then drawn from the distribution obtained after examining all known ratings for that movie. The same process is used for user vectors.

Note that unlike simonfunk, who builds up his vectors one element at a time, my approach produces every element of the vector in one step (using some fairly simple linear algebra). The book "Bayesian Data Analysis" describes the necessary equations to do this.

The RMSE of 0.8937 was obtained using 256 element vectors, and took several weeks to compute on a 3GHz PC.