Recommendation systems are a must in most businesses today. Not only can they help identify upsell opportunities but they can also help identify hot prospects based on characteristics from an existing customer base. It’s an exciting tool to execute for businesses everywhere. Below is my journey through working with a team to build our first movie recommendation model.
Loading and Cleaning Data
First things first, loading and cleaning the data. We actually started with five different sets, three from the Movielens dataset from the GroupLens research lab at the University of Minnesota and two that we pulled down from the IMDB website. We selected the small Movielens dataset that contained over 100,000 ratings with 3,600 tag applications applied to over 9,000 movies by 600 users. This list was last updated 9/2018. Below is a sample of the combined Movielens dataset:
Next, we looked at the IMDB data. Reading the data from these files was a little tricky as they were saved as .tsv.gz files. So we had to specify the type of compression that was used (compression = ‘gzip’) as well as how the files were separated (sep= ‘\t’). These databases each had over 7 million rows of data. But, we would only need the information for the 100,000 list from the base Movielens data set. But before we can join them, we need to clean up a couple of things on the IMDB dataset, starting with the column that would be used to match up the data. So after renaming the columns on the IMDB data, reformatting the characters in that column, joining to the Movielens dataset, and condensing down to only the columns that will be used in our analysis, here is the updated view:
Next up, EDA! In order to better understand the data, we started to look closely at the data set to understand what we need to do to prepare for modeling. First we created dummy variables to make it easier to generate visualizations. The first question we wanted to answer was regarding rating bias. We wanted to know if there were genres that were more popular than others. So, we created a bar chart to show the median rating per genre. Take a look:
As you can see, there doesn’t seem to be much variation in terms of the median rating per genre, with all genres having a medium rating of around 3.5–4. That is good news as we will not have deal with genre bias in the data.
It then made sense to look to see if there any movie genres that dominated the higher ratings, but we found that to be a non-issue as well.
Another question we wanted to answer, “Is there a pattern for runtime for high/low rated movies?” So, we created a scatter plot to see what we could see.
There does not appear to be any distinct trend for ratings based on runtime minutes.
We will also explore the distribution of the movie release year in the data.
Movies made between 1990 and 2010 make up the majority of the movies in this dataset. Here are the top 20:
['Shawshank Redemption, The (1994)',
'Pulp Fiction (1994)',
'Forrest Gump (1994)',
'Matrix, The (1999)',
'Star Wars: Episode IV - A New Hope (1977)',
'Silence of the Lambs, The (1991)',
"Schindler's List (1993)",
'Godfather, The (1972)',
'Fight Club (1999)',
'Star Wars: Episode V - The Empire Strikes Back (1980)',
'Usual Suspects, The (1995)',
'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
'American Beauty (1999)',
'Star Wars: Episode VI - Return of the Jedi (1983)',
'Terminator 2: Judgment Day (1991)',
'Lord of the Rings: The Fellowship of the Ring, The (2001)',
'Saving Private Ryan (1998)',
'Princess Bride, The (1987)']
We did discover that over 6000 movies had less than 5 ratings, which makes those ratings less reliable. Since removing these would impact the data count tremendously, the decision was made to leave the data in. So we looked at the relationship between the number of ratings a movie received and the average rating.
Of the 15 movies that were rated most often, nearly all of them outperformed the average rating from the dataset as a whole. As a result, the model may recommend these more often simply because more people enjoyed these.
Modeling and Function
Next up, modeling! Using the surprise library, we will start with baseline memory-based (KNN basic) and implement an SVD model in the recommendation system. As usual, we start with splitting the data into training and testing sets. Then we fit to a KNN Basic model to get our first RMSE value. RMSE for the baseline model is 0.98506, meaning that on average the model’s predictions for user ratings are approximately 1 point off (on a scale of 0.5–5). So, we tried to improve that by fitting an SVD model. Using the default SVD parameters, we landed on an RMSE of 0.86982. This was better than our baseline model, but the next step was to complete a gridsearch to identify optimal SVD parameters in hopes we can improve the RMSE. The new model, using the gridsearch provided parameters, improved the RMSE minimally to 0.85677. So we will use this as our final model.
Lastly, and most fun of all, our recommendation function. This is the fun part of the project that asks a new user to rate 5 different movies from specific genres. And, based on those ratings, the function presents 5 movies for the new user.
Conclusion and Future Work
In conclusion, the final model we selected is not a perfect fit based on the final RMSE of 0.87, but is less than the standard deviation of the original ratings dataset (1.04). Recommendations appear to be more accurate when the genre is specified, so genres should be included in the execution of the new model.
Future work to potentially improve the RMSE include:
- Obtaining reviews for movies that do not have a sufficient amount of reviews or consider removing them from the dataset.
- Investigating approaches to deal with popularity bias to increase the representation of less popular movies (consider including weights in the model)
- Creating a more robust model with LightFM by incorporating movie features into weighting
- Calculating similarity metrics between recommended movies and highest rated movies to better validate recommendations