Having understood the basics of recommendation systems, you will now implement a recommendation algorithm in Python. You can use the MovieLens dataset for running this algorithm. MovieLens is a non-commercial, personalised movie recommendations platform. It helps you find movies that you will like. When you log into the platform, it asks you to rate movies to build a custom taste profile, after which MovieLens recommends other movies for you to watch. Thus, as you would have guessed, it uses collaborative filtering to generate recommendations. In this notebook, you will implement user-based recommendation and item-based recommendation.
You can download the Python code for this lab from the link below.
As you saw, first we convert the 'ratings' data into a tabular form using a pivot table to get 'userId' as rows and 'movieId' as columns. We will use this table to find the user-based recommendation and item-based recommendation by finding the correlation. Next, we will form a dummy train and a dummy test, which will be used during the prediction phase. Let's see the values in dummy train and dummy test.
As you saw, the dummy train contains '0' wherever the user has already rated. This is done because you do not want to recommend the movies that have already been rated by the user. Marking those values '0' will retain only the movies that are not rated by the user in the 'df_movie_features' table. Later, you will see how to use 'dummy_train'. Similarly, we want to evaluate our model only on the movies rated by the user. So, 'dummy_test' is the opposite of 'dummy_train' in terms of 0's and 1's.
You also saw that adjusted cosine similarity is used for user-user similarity, as different users can have different behaviour. A user will rate in the range of 1-3 star for the worst to best movie, whereas a different user will rate in the range of 1-5 star for the same set of movies. To bring both the users on par, the ratings are normalised using adjusted cosine similarity. Now, let's find the correlation for user-based similarity.
As you can see, the training data set has 862 users and 2,500 movies. So, the correlation matrix for user-based similarity will be of the shape 862 x 862. This matrix indicates to what extent the users are related to other users. Multiplying correlation matrix with 'movie_feature' (rating by the user) of shape 862 x 2500 will give the user-based recommender matrix of shape 862 x 2500. This matrix will contain the user-based rating of all the movies, including the movies not rated by the user, based on the correlation with other users. Thus, you can find the recommendation of a user by filtering out the rating given by that user (using 'dummy_train').