Recommendation Systems

Slides

 * The slides of the presentation about Recommendation systems: Recommendation-systems.pdf


 * The slides of the presentation about our project:Project RecommendationSystems-0.pdf

Source code
https://www.dropbox.com/sh/c7p52pe5jl5645t/AABo6ZupnTJSl4gWJAOssUxSa?dl=0

You may downoad the source code of our project here. The sparse matrices used are in the zipped file "matrices".

Introduction
In the first part of this project, we develop some useful methods to perform simple recommendation algorithms. As explained in the enclosed file Recommendation-system.pdf, there are two broad groups :


 * Contend-Based Recommendation Systems
 * Collaborative Filtering

Another group of algorithms has been presented too. Based on latent factors, these methods are quite new and strongly related to matrix theory. We can cite for instance the SVD, CUR decomposition etc. The next step in this project will focus on the application of such methods as part of the Netflix Challenge. A significant boost to research into recommendation systems was given when Netflix offered a prize of $1,000,000 to the first person or team that would beat their own recommendation algorithm, CineMatch, by 10 %. After over three years of work, the prize was awarded in September, 2009. In the futur sections, we will consider us as a team participating in the Netflix Challenge. We will suggest some basic algorithms to compete with CineMatch.

We divided the project into two distinct parts : the first goal is to provide some movie recommendations to a particular user, while in the second part the focus is to estimate each entry of the utility matrix, i.e. estimate the rating of each user for each movie. But before this, we had to make some preprocessing of the netflix database in order to read it easily.

Database : Netflix Dataset
First and foremost, let’s take a look at the data available to us. the whole data can be download from here, or in the dropbox link. Here is a small decription of them :

SUMMARY ================================================================================

This dataset was constructed to support participants in the Netflix Prize. See http://www.netflixprize.com for details about the prize. The movie rating files contain over 100 million ratings from 480 thousand randomly-chosen, anonymous Netflix customers over 17 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period. The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, each customer id has been replaced with a randomly-assigned id. The date of each rating and the title and year of release for each movie id are also provided.

USAGE LICENSE ================================================================================ Netflix can not guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions: * The user may not state or imply any endorsement from Netflix. * The user must acknowledge the use of the data set in      publications resulting from the use of the data set, and must send us an electronic or paper copy of those publications. * The user may not redistribute the data without separate permission. * The user may not use this information for any commercial or      revenue-bearing purposes without first obtaining permission from Netflix. If you have any further questions or comments, please contact the Prize administrator 

TRAINING DATASET FILE DESCRIPTION ================================================================================ The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format: CustomerID,Rating,Date - MovieIDs range from 1 to 17770 sequentially. - CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. - Ratings are on a five star (integral) scale from 1 to 5. - Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION ================================================================================ Movie information in "movie_titles.txt" is in the following format: MovieID,YearOfRelease,Title - MovieID do not correspond to actual Netflix movie ids or IMDB movie ids. - YearOfRelease can range from 1890 to 2005 and may correspond to the release of  corresponding DVD, not necessarily its theoterical release. - Title is the Netflix movie title and may not correspond to   titles used on other sites. Titles are in English.

To sum up, we have a file for each movie containing a list of ids (users) associated with their ratings for the associated movie. Before going any further, we must turn those data into something with which we can work easily. As explained in the attached document, a convenient representation for those data is the Utility Matrix, which represents for each user, the degree of preference of that user for that item. We assume that the matrix is sparse, meaning that most entries are ‘unknown’. An unknown rating implies that we have no explicit information about user’s preference for the item.

The utility matrix that will be created has a size of ( #users, #movies), in other words (480 189 , 17 770). And the problems have already begun. Indeed, the complete storage of the matrix requires a lot of computer memory. If the ratings are stored in the form of 8bits integers, the necessary memory would be :

480189*17770/10^9 = 8.5 Gb

Quite huge, isn’t ? However, we considered that all entries were stored in the matrix. In order to avoid filling the memory with useless informations, we can use a sparse matrix representation, containing only the non-zero entries. If we do that, we can decrease the memory consumption to 1.779 Gb.

We develop the program ‘setup.py’, which has the goal of building the sparse matrix. The scipy.sparse package is used to represent the matrix in the form of a Compressed Sparse Row matrix. The output result should be something like that :

Laterres-MacBook-Pro:python Alex$ python setup.py -i Dataset/ -o matrix.pickle Import data -> 100% |##############################################################|Elapsed Time: 0:15:05 17770 Merge files -> 100% |#################################################################|Elapsed Time: 0:05:01 36 Mapping -> 100% |############################################################|Elapsed Time: 13:32:44 480189 Build the matrix ... DONE Export the matrix ... DONE

Let me explain the different steps performed by the program. First of all, the file for each movie is read, and appended into two vectors. One for the ratings and the other storing the ids who have rated the movie. However to avoid filling the memory with vectors growing at this rate. We stop the reading after 500 movies, to create a temporary file. This one will save the vectors and will free the RAM. In this way, by step of 500 files read, we save the temporary result into memory.

The second step is to merge those temporary files into a bigger one in order to perform the last step, the mapping. The README of the Netflix database provided above, reports that : “CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users”. There are gaps between UserIds, but we don’t want to create a matrix containing 2649429 lines. Therefore, we must map the UserIds into a smaller indexing (between 0 and 480188). Actually, these two lines of code are the most time-consuming because it must be done for all the ratings (~10^8 ratings).



After quite a long time, we can finally be in ecstasy at the utility matrix. The matrix contains over 100 million ratings from 480 thousand users and 17770 movies.

>>> import cPickle as pickle >>> inport numpy as np >>> import_name = ‘matrix.pickle’ >>> with open(inport_name,’r’) as file: ...       matrix = pickle.load(file) >>> matrix.shape (480189, 17770) >>> matrix.nnz 100480507 >>> matrix.dtype dtype('uint8')

After, this pre-processing, we can finally begin talking about the recommendation systems. In the next sections, we will present a system based on collaborative filtering and an other one based on latent factors. We choose not to talk about Content-Based systems because that would have required a item profile for each movie (see slides.pdf).

Collaborative filtering
In this part, our goal is to propose some movie recommendations to a particular user (called U) based on his previous ratings about other movies. To do such a think, we based our implementation on the collaborative filtering idea : we first compute the similarity between U and each other user in order to select a group containing the more similar users. Once we have this group, we check each movie seen by one of these users, and get those that are the most liked by these similar users in order to recommend them to U. Remark that we could also use the item-based collaborative filtering (search similar movies), but the process of computing most similar movies has to be done for each movie, while this process has only to be done once for the user to whom we would like to give some recommendations.

To confirm the importance of the use of sparse matrices from scipy, our first idea was to implement the algorithm with dictionaries taht would stock only the necessary informations (see the code on ./Collaborative Filtering/Dictionaries). But the results were dreadfully slow! This is because the Python dictionaries need many space in memory.

We then had the idea of using the sparse matrices of python as following: to compute the cosine similarity between user U and all other users, we just have to compute the dot product $$ UM \cdot UM[U,:]^T $$, and we then divide each element i of the resulting vector by $$ \| UM[U,:] \| \| UM[i,:] \| $$. Knowing that python has different ways to represent sparses matrices, we chose here the CSR format (Compressed Sparse Row matrix), that has two main advantages for us   : fast matrix vector products, and efficient row slicing. Once we computed each similarity, we can keep the n more similar users and look at the movies that those users rated highly, while making sure that U have not seen these movies yet.

The next improvement was to add weight to the users for the movies they rated. This idea was not present in the reference book, but nevertheless seems logical : the more a user is similar to U, the more his opinion is important. In fact, we just used the cosine similarity as weights. We then just have to divide ratings by the sum of the weights to normalize.

Speaking of normalizing, we added another option to work on the normalized matrix (we removed the average ratings of each user, see the code on ./Collaborative Filtering/Sparse matrix/normalMatrix.py). Indeed, such a matrix is often used when working with the cosine similarity, as 0 (or unknown values) are treated as neutral value in a normalized matrix while they are treated as bad value in the basic utility matrix. In order to work with the normalized utility matrix, we first had to compute the rating average for each user and stock it in a file.

See the final code on ./Collaborative Filtering/Sparse matrix/collabFiltering.py for the algorithm, and the test script on ./Collaborative Filtering/Sparse matrix/recommender.py, where the parameters are available.

For example, gives the next output:

Top 5 movies user 6 may like Nip/Tuck: Season 1: 5.921875 Yossi & Jagger: 5.921875 The Sopranos: Season 4: 5.11764818133 Raging Bull: 5.11564903456 The Sopranos: Season 2: 5.06049950683

Some computation times
Where n is the number of more similar neighbours to visit and m is the number of movie to recommend. For the dictionaries, we tested the algorithm with smaller number of movies because of the slowness of the method (all movies represent the 17770 movies of the dataset). We see that the use of dictionaries is much less effective, especially for loading time. This is due to the fact that dictinnaires take a huge space in memory compared to arrays, and even more to sparse matrices. The reason why loading normalized martrix is faster is that this matrix is more sparse than the basic utility matrix : all values equal to the average of their respective user are set to 0. Finally, we observe that the importance of n is not very high on the computation time. About the m, we can predict in advance that it will not matter because the algorithm computes all the movies seen by the n most similar users, then sort them and keep the m with highest ranks.

Matrix Factorization
In this section, we will talk about a matrix decomposition, called UV decomposition, a sort of Singular Value Decomposition. We would like to find a matrix U with n rows and d columns, and a matrix V with d rows and m columns, such that UV closely approximates M in those entries where M is non-blank (using the 2-norm). But the UV decomposition and more generally the SVD aren’t defined when entries are missing. So we use an incremental computation algorithm, in which the matrices U and V are repeatedly adjusted to reduce the RMSE.

Update $$ U_{rs} $$:

$$ u_{rs} = \frac{\sum_j v_{sj} \left( m_{rj} - \sum_{k \neq s} u_{rk}v_{kj}\right)}{\sum_j v^2_{sj}} $$

Update $$ V_{rs} $$:

$$ v_{rs} = \frac{\sum_i u_{ir} \left( m_{is} - \sum_{k \neq r} u_{ik}v_{ks}\right)}{\sum_i u^2_{ir}} $$

There are three areas where we shall discuss the options :
 * Initialing U and V
 * Ordering the optimisation of the element of U and V
 * Ending the attempt at optimisation

Initialization
Because of the existence of many local minima, we could run many different optimisations in the hope of reaching the global minimum on at least one run. We can for example vary the initial values of U and V. A simple starting point for U and V is to give each element the same value. A good choice for this value is that which gives the elements of the product UV the average of the non-blank elements of M. If we have chosen d as the length the short sides of U and V, and a the average nonblank element of M, then the element of U and V should be $$ \sqrt{a/d} $$. This simple method has been implemented in our algorithm. In practice, we perform the decomposition for different starting points (a number chosen by the user). We add at each element $$ \sqrt{a/d} $$ a normally distributed value with mean 0 and a standard deviation of 1. At the end, we keep the best approximation UV (with smallest RMSE).

Performing the Optimization
In order to reach a local minimum from a given starting value of U and V, we need to pick an order in which we visit the elements of U and V. The simplest thing to do is pick an order, in our case e.g. row by row for U and column by column for V, and visit them in round-robin fashion.

Converging to a minimum
Ideally, at some point the RMSE becomes 0, and we know we can’t do better. Nevertheless, in practice, since there are normally more nonblank elements in M than there are elements in U and V together, we have no right to expect that we can reduce the RMSE to 0. Thus we have to detect when there is little benefit. We can track the amount of improvement in the RMSE obtained in one round of optimisation (one update of U and V), and stop when that improvement falls below a threshold (by default 0.01, but can be overwrite by the user).

Execution
A complete running of the algorithm is available below. It has been performed on matrix 1000x1000 (a block of the real utility matrix), for a tolerance of 0.001, a bound on the number of iterations equal to 10 and the reduction to a dimension 50. (To run the program like did, you may need to install the progress bar library for python )

Laterres-MacBook-Pro:python Alex$ python UV_decomposition.py -i matrix_001000.pickle -d 50 -n 5 -e 0.001 -m 3 ******** IMPORT  DATA  ******** file imported -> matrix_001000.pickle Number of nnz -> 9884 ******** INITIALISATION ******** matrix U --> (m,n) (1000, 50), type :float32 matrix V --> (m,n) (50, 1000), type :float32 ******* UV DECOMPOSITION ******* RMSE -> 7.880767 Update U -> 100% |########################################################################|Elapsed Time: 0:01:02 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:57 1000 RMSE -> 1.36777 (decrease 6.51299) Update U -> 100% |########################################################################|Elapsed Time: 0:00:59 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:59 1000 RMSE -> 0.58260 (decrease 0.78517) Update U -> 100% |########################################################################|Elapsed Time: 0:01:01 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:56 1000 RMSE -> 0.37404 (decrease 0.20856) Update U -> 100% |########################################################################|Elapsed Time: 0:01:01 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:58 1000 RMSE -> 0.27825 (decrease 0.09580) Update U -> 100% |########################################################################|Elapsed Time: 0:01:01 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:56 1000 RMSE -> 0.22268 (decrease 0.05556) ******** EXPORT RESULTS ******** Export U ... DONE Export U ... DONE ******** INITIALISATION ******** matrix U --> (m,n) (1000, 50), type :float32 matrix V --> (m,n) (50, 1000), type :float32 ******* UV DECOMPOSITION ******* RMSE -> 8.032293 Update U -> 100% |########################################################################|Elapsed Time: 0:01:03 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:01:02 1000 RMSE -> 1.46150 (decrease 6.57079) Update U -> 100% |########################################################################|Elapsed Time: 0:01:03 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:01:00 1000 RMSE -> 0.60529 (decrease 0.85621) Update U -> 100% |########################################################################|Elapsed Time: 0:01:06 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:01:05 1000 RMSE -> 0.37446 (decrease 0.23082) Update U -> 100% |########################################################################|Elapsed Time: 0:01:03 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:59 1000 RMSE -> 0.27061 (decrease 0.10385) Update U -> 100% |########################################################################|Elapsed Time: 0:01:03 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:58 1000 RMSE -> 0.21073 (decrease 0.05988) ******** EXPORT RESULTS ******** Export U ... DONE Export U ... DONE ******** INITIALISATION ******** matrix U --> (m,n) (1000, 50), type :float32 matrix V --> (m,n) (50, 1000), type :float32 ******* UV DECOMPOSITION ******* RMSE -> 7.957307 Update V -> 100% |########################################################################|Elapsed Time: 0:00:58 1000 Update U -> 100% |########################################################################|Elapsed Time: 0:01:02 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:59 1000 RMSE -> 1.40083 (decrease 6.55648) Update U -> 100% |########################################################################|Elapsed Time: 0:01:03 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:01:07 1000 RMSE -> 0.59019 (decrease 0.81064) Update U -> 100% |########################################################################|Elapsed Time: 0:01:05 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:59 1000 RMSE -> 0.37058 (decrease 0.21961) Update U -> 100% |########################################################################|Elapsed Time: 0:01:02 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:58 1000 RMSE -> 0.26892 (decrease 0.10165) Update U -> 100% |########################################################################|Elapsed Time: 0:01:00 1000 Update V -> 100% |########################################################################|Elapsed Time: 0:00:58 1000 RMSE -> 0.21048 (decrease 0.05844) ******** EXPORT RESULTS ******** Export U ... DONE Export U ... DONE

Recommendations
Once the approximation by the matrices U and V have been performed, we can very easily give some recommendations to a user u by making the product of the row u of U and the matrix V. In that way, we have all the predicted ratings and the best of them (higher ratings) are recommended to user u.

To do so, you have to import the file recommendationSystem.py. The matrices U and V will be brought into memory and we can get the recommendations we want by calling getRecommendation(userId,topK). Here is an example :

>>> import recommendationSystem as rs ******** IMPORT  DATA  ******** file imported ---> U.pickle and V.pickle Size Utility Matrix -> (1000,1000) Dim Approximation --> 100 >>> userId = 43 >>> nbrRecommendations = 15 >>> rs.getRecommendation(userId,nbrRecommendations) predicted rating : 5.66 -> movie : The Bedford Incident predicted rating : 5.66 -> movie : Magnolia: Bonus Material predicted rating : 5.65 -> movie : Gloria Estefan: Don't Stop! predicted rating : 5.34 -> movie : Agatha Christie's Poirot: Sad Cypress predicted rating : 5.32 -> movie : Popular: Season 2 predicted rating : 5.19 -> movie : Monarch of the Glen: Series 2 predicted rating : 5.09 -> movie : Arliss: The Best of Arliss predicted rating : 5.03 -> movie : Dario Argento Collection: Vol. 2: Demons 2 predicted rating : 4.97 -> movie : Nightwalker predicted rating : 4.93 -> movie : Simple Men predicted rating : 4.77 -> movie : Foyle's War: Set 2 predicted rating : 4.75 -> movie : Crunch: Pick Your Spot Pilates predicted rating : 4.72 -> movie : The Trouble with Angels predicted rating : 4.71 -> movie : Oasis predicted rating : 4.70 -> movie : Cinderfella

Root Mean Square Error


As we are seeing with some relief on the plot, the rmse decreases at each iteration (update of U or V). In practice, since there are normally more nonblank elements in M than there are eleme nts in U and V together, we have no right to expect that we can reduce the RMSE to 0. However, we can observe that if we increase the dimensionality of the reduction (parameter d), the RMSE becomes smaller, as expected.

Drawback
That sounds good ! However, how many times does it take to reach such precision? The main drawback of this algorithm is its computation time. As we can see on the plot below, the bigger the matrix is, the longer is the elapsed time taken for one update of U and V. Two interesting facts can be deduced from the plot. First, the complexity is linear in relation to the dimension d used. Secondly, the complexity in relation to the size of the matrix is quadratic. Altough the theoritical global complexity is $$ O(mnd^2) $$, in pratice we obtains a complexity closer to $$ O(mnd) $$. Certainly because we don't perform every computations, but only the ones for which the values are needed, i.e. non blank entries of the utility matrix.



Since we would like to use this algorithm for applications in Big Data, we have to find a viable solution to get around this inconvenient. The first idea is to use parallelism. Since even laptops nowadays have 4 or more processors, we would like to use them all to solve our decomposition. The updates of rows of U are totally independent. In the same way, the updates of columns of V are independent. Therefore, We built a multi-Threading version of the algorithm to exploit this feature. The elapsed time for one update of U and V for a parameter d equal to 50 for a utility matrix of size [1000 x 1000] is available below.

As we can see on the plot, it wasn’t really an efficient idea. The multi-Threaded implementation is worst than our first one. Why ? Simply because of the bottleneck of the algorithm isn’t really the bad complexity of the method but the time required to access memory. Consequently, we gave up this approach. Another way to get around the time consumption would be to perform a kind of MapReduce, i.e. we could work with a piece of the utility matrix and once the computation done, we could put it back, and continue the computation process with an other piece of the utility matrix. Although, it’s a good track to put into practice the UV decomposition for large databases, we didn’t implement it.

Overfitting
When there is too many freedom (many free parameters), the model starts fitting noise. It fits too well the training data but don't generalize well to unseen test data. The model loses its generality ability. We notice this phenomenon on the next example. In the first run of the program, the tolerance (-e) is quite big so the iteration stopped quickly. On the other hand, on the second run of the program, the tolerance is much smaller. Although the RSME in the second case is much smaller at the end, the results obtained are not necessarily better. The values returns are clearly out of the bounds 1-5. This phenomenon looks like the Runge Phenomenon when we try to approximate a function with a high degree polynomial.

Laterres-MacBook-Pro:UVdecomposition Alex$ python UV_decomposition.py -i matrix_000100.pickle -n 5 -m 1 -e 0.01 -d 10 ******** IMPORT  DATA  ******** file imported -> matrix_000100.pickle Number of nnz -> 62 ******** INITIALISATION ******** matrix U --> (m,n) (100, 10), type :float32 matrix V --> (m,n) (10, 100), type :float32 ******* UV DECOMPOSITION ******* RMSE -> 4.854078 Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.24624 (decrease : 4.60784) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.04121 (decrease : 0.20503) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00976 (decrease : 0.03145) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00254 (decrease : 0.00722) ******** EXPORT RESULTS ******** Export U ... DONE Export U ... DONE

>>> import recommendationSystem as rs ******** IMPORT  DATA  ******** file imported ---> U.pickle and V.pickle Size Utility Matrix -> (100,100) Dim Approximation --> 10 >>> rs.getRecommendation(10,10) predicted rating : 7.66 -> movie : 6ixtynin9 predicted rating : 3.71 -> movie : G3: Live in Concert predicted rating : 3.35 -> movie : The Killing predicted rating : 3.35 -> movie : Outside the Law predicted rating : 3.30 -> movie : The Libertine predicted rating : 3.18 -> movie : Pitcher and the Pin-Up predicted rating : 3.17 -> movie : Richard III predicted rating : 2.89 -> movie : A Yank in the R.A.F. 	 predicted rating : 2.86 -> movie : Jonah: A VeggieTales Movie: Bonus Material predicted rating : 2.74 -> movie : Horror Vision

Laterres-MacBook-Pro:UVdecomposition Alex$ python UV_decomposition.py -i matrix_000100.pickle -n 10 -m 1 -e 0.0000001 -d 10 ******** IMPORT  DATA  ******** file imported -> matrix_000100.pickle Number of nnz -> 62 ******** INITIALISATION ******** matrix U --> (m,n) (100, 10), type :float32 matrix V --> (m,n) (10, 100), type :float32 ******* UV DECOMPOSITION ******* RMSE -> 5.482015 Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.18549 (decrease : 5.29652) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.01613 (decrease : 0.16936) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00155 (decrease : 0.01459) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00017 (decrease : 0.00138) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00002 (decrease : 0.00015) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00000 (decrease : 0.00002) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00000 (decrease : 0.00000) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00000 (decrease : 0.00000) Update U -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 Update V -> 100% |###############################################################################|Elapsed Time: 0:00:00 100 RMSE -> 0.00000 (decrease : 0.00000) ******** EXPORT RESULTS ******** Export U ... DONE Export U ... DONE

>>> import recommendationSystem as rs ******** IMPORT  DATA  ******** file imported ---> U.pickle and V.pickle Size Utility Matrix -> (100,100) Dim Approximation --> 10 >>> rs.getRecommendation(10,10) predicted rating : 21.15 -> movie : Louder Than Bombs predicted rating : 20.35 -> movie : Classic Albums: Meat Loaf: Bat Out of Hell predicted rating : 18.96 -> movie : Paula Abdul's Get Up & Dance predicted rating : 18.52 -> movie : What the predicted rating : 16.04 -> movie : Chump Change predicted rating : 15.12 -> movie : Jonah: A VeggieTales Movie: Bonus Material predicted rating : 14.43 -> movie : The Powerpuff Girls Movie predicted rating : 13.50 -> movie : The Battle of Algiers: Bonus Material predicted rating : 13.04 -> movie : G3: Live in Concert predicted rating : 11.58 -> movie : Rudolph the Red-Nosed Reindeer

To solve overfitting, we can introduce a regularisation as explained during the presentation, of the form :

$$ \min\limits_{U,V} \sum_{(i,j) \in \mathbb{R}} \left( m_{ij} - u_i v_j^T \right)^2 + \lambda \left( \sum_i \| u_i \| ^2 + \sum_j \| v_j^T \|^2 \right) $$

This approach is left for further improvement of the method. To resolve this kind of optimization problems, we can use the Gradient Descent (or the stochastic gradient descent as proposed in the reference )

Conclusion
In conclusion, the first part of our work was based on the work of the Netflix dataset representation. With sparses matrices from scipy, we were able to represent the utility matrix so that it reaches about 1.7 Gb. In this step of preprocessing, we also created a file performing a mapping between real user ids(from 1 to 2649429 with gaps) and ids without gaps (from 0 to 480188) and also created a file with links between movie ids and their name.

About the collaborative filtering algorithm, we were able to significantly speed up the calculation time through the benefits of csr (Compressed Sparse Row) matrices that allow effective matrix-vector product, which is useful to compute cosine similarity between a user and all the others. From there, we got a quick and efficient method to recommend movies to a particular user. We have also implemented some improvements such as the weighting of the recommended ratings (equals to the similarity of other users), but we also worked on a normalized matrix that treats the unknown values as neutral values, and also make the matrix even sparser which reduces its memory space and loading time.

About matrix factorization, we have implemented the basic algorithm that iteratively updates the components of the matrices U and V. Here, there are still some possible improvements: as we implemented it, this algorithm does not work on very large matrices (large computation time and memory space). Moreover, the overfitting phenomenon can be an inconveniant and include some regularization would be a good idea for future improvements. However, our implementation is already working on matrices of reasonable size and not making too many iterations. The advantage of this algorithm is that it can then estimate the ratings of the entire utility matrix (this differs from collaborative filtering algorithm that recommend movies to a unique user).