1 billion dollars per year. That’s how much Netflix’s Chief Product Officer Neil Hunt estimates the company saves per year thanks to their global recommendation system. No wonder you’ve found yourself searching for how to build a recommendation engine in R! They’re valuable commodities!
From tech giants like Netflix to Amazon to YouTube, enterprises all over the world are recognizing the importance of recommendation engines in order to keep their customer base engaged and their conversions high. And they’re looking for data professionals like you to build them.
Here at Data Mania, we help give data pros a leg up by helping them make (and save) money for the corporations they serve, so they can advance their data career and get the promotion and raise they deserve.
Let’s get started today with helping you build a recommendation engine in R that’s sure to help your company’s customer retention and profitability. First I’m going to provide you a conceptual overview of the topic, and then I’ll show you the exact steps on how to build a recommendation engine in R.
Make sure to stick around to the end for another show-stopper of a recommendation on how to get ahead FAST in your data science career.
Ready? Go.
KEY CONCEPTS RELATED TO RECOMMENDATION SYSTEMS
Before showing you how to build a recommendation engine in R, let’s get you up-to-speed on the concepts behind how recommendation engines work.
What’s a recommendation engine?
If you’re a developer who’s here just to see how the code works… I understand! CLICK HERE TO SKIP THE INTRO and go straight to where I show you the code for how to build a recommendation engine in R.
In case you’re a total newbie to marketing data science, let’s get a little clearer on the concepts of recommendation engines and how they’re used.
Let’s take Amazon as an example. Every time you go buy something on Amazon, under the product you’ll see the heading ‘People Who Purchased This Item Also Purchased…’ (or something along those lines) with a selection of products underneath. Those recommendations are made automatically by a decision engine that sits on the backend of the platform. Today, you’re going to learn the exact steps to understand how to build an engine that functions the same way.
Before getting into the nitty-gritty about how recommendation engines work, let’s first take a step back and refresh our memory about what exactly a recommendation engine is.
In essence, a recommendation engine is an automated decision engine that evaluates similarities between people (ie. “users”) and/or items in order to make recommendations about what items go well together.
The underlying methods behind recommendation engines can be used for a variety of applications, but the most common application is often e-commerce. In this application, the recommendation engine identifies items that have a high-propensity for user consumption, and recommends those items to only the most appropriate users.
When it comes to marketing science, recommendation systems have been a breathtaking disruption to traditional cross-selling strategies.
They’ve allowed us to significantly drive conversion rates up by automating the identification and recommendation of related products. In ecommerce this represents a true win-win, where buyers are satisfied because they get an ideal combination of products, and sellers are happy because they enjoy more sales and a higher ROI. What’s not to love?! 😉
The go-to case study: NetFlix movie recommendations
The go-to case study for recommendation engines is the NetFlix recommender that I mentioned above. In fact, Netflix runs many layers of recommendations, each operating according to its own unique set of instructions. But it wasn’t until 2009 that Netflix really broke ground with its recommendations, back when it hosted an open competition on Kaggle.
In the competition, participants were asked to predict user ratings for new films by using previous user rating data for films they’d already seen.
If the predictions made by the engine had a high degree of accuracy, Netflix would select the team’s engine to make recommendations to its users.
In the end, a team developed a recommendation algorithm that performed 10% better than NetFlix’s existing algorithm, bagging them a $1 mil in cash (not bad, right?! 💥)….just to give you a general idea of how much these algorithms are worth to Netflix.
First things first: understanding collaborative filtering
Recommendation engines use collaborative filtering. Like the name suggests, collaborative filtering uses data from other people (or “users” on the platform) to make its prediction. Collaborative filtering can work a few different ways.
One possible way to use a collaborative filtering algorithm could be to ‘filter’ similar purchases users made in the past to generate and then recommend a list of items that go well together in combination. In this example, items that are not frequently purchased together would be excluded from the list, and the engine would make recommendations from a final set of items that have a history of being purchased together.
2 Types of Collaborative Filtering Algorithms – User-based collaborative filtering and Item-Based Collaborative Filtering.
I’ll define these in terms of movie recommendation systems, using Netflix again as our trusty example.
A Screen Grab From My LinkedIn Learning Course:
Building a Recommendation System with Python
User-based collaborative filtering systems:
A user-based recommendation engine recommends movies based on what other users with similar profiles have watched and liked in the past. As an example of a user-based recommender, imagine there’s a big movie buff who loves watching movies regularly, usually every Friday evening. He’s an unmarried man and a working professional. A user-based recommender would go in and look up movie recommendations based on what other unmarried, professionnel men who watch movies regularly have liked.
Item-based collaborative filtering systems:
An item-based recommender would make recommendations based on similarities between movies; in other words, it would recommend movies that are similar to ones a user already likes. Say you watched the movie ‘Kung Fu Panda’ and you liked it so much you gave it five stars. A item-based collaborative filtering system would then look into similar movies from the same genre (perhaps animated, fighting, comedy or films with a similar storyline) and then recommend similar movies based on the preference you displayed when giving ‘Kung Fu Panda’ five stars.
In fact, item-based collaborative filtering systems can even make recommendations based on any variety of common elements, like movies about pandas, movies from the same producers, directors, etc…the possibilities are truly endless! In the case of Kung Fu Panda, it’s most likely that ‘Kung Fu Panda 2’ and ‘Kung Fu Panda 3’ will be suggested to the user, followed by other cases.
If only life were that simple
Now that you understand the basics about collaborative filtering algorithms, let’s go ahead and add a little complexity to the discussion. Don’t worry, I’m going to be showing you how to build a recommendation engine in R very soon!
If you really think about it, a user should have to do more than simply watch a movie in order for the film to qualify as being recommendable to other users. After all, the user may have seen the movie and absolutely hated it! If that were the case, recommending the movie to similar users could be a potentially terrible idea.
Instead of just looking at how many times a movie was viewed, we’ve actually got to take into account the rating each user gave the movie (aka; “movie ratings data”).
By doing this, we can see what movies similar users have enjoyed and use that data to filter our movie recommendations accordingly. Now, the recommendation will only include the movies which are rated highly by other similar users.
Real-life recommenders that are in-production on ecommerce platforms are usually quite complex. They almost always hybridize the two collaborative techniques we’ve discussed above. These recommendation engines may, for example, suggest a movie based on what other users with similar profiles have enjoyed, and then further order the recommendations based on how similar those movies are to the movie you last watched. My point here is that all recommendation engines all have their own utility in different situations, so decisions about the best logic to use requires data scientists and machine learning engineers alike to use solid reasoning and sound strategy alike when planning initiatives.
Where machine learning fits in
Both recommendation methods we discussed above (user-based collaborative filtering systems and Item-based collaborative filtering systems) can use clustering as the backbone, although there are other machine learning algorithms that may be better suited for the job depending on your project requirements.
Clustering algorithms allow you to group users and items based on similarity, so these are an easy fit when building a recommendation engine.
Another way to make recommendations might be to focus on what’s dissimilar between users and/or items. Needless to say, the machine learning algorithms you choose largely depend on the specifics of your unique project.
Don’t forget about the content-based recommenders
Wait! There’s one more type of recommendation system we haven’t gotten around to yet – content-based recommendation systems. Content-based recommenders are an alternative approach you can use when you don’t have a ton of data available. The speed of content-based recommenders, however, largely correlates with the dataset’s size, making them unfit for large datasets.
One advantage to content-based recommenders?
You can use them to start recommending newer items that still don’t have user ratings (fixing what’s known as the “cold start” problem). This is helpful for getting new products out in front of your user base, so they can quickly begin to gain traction.
That being said, collaborative filtering systems still have a lot of advantages over content-based recommenders. These advantages include:
- They can handle huge, high-dimensional datasets.
- They can suggest niche items (items popular among only a specific segment of users).
- They can suggest items which may be from a completely different product category altogether.
- Based on the type of data you have, a collaborative filtering system can suggest items purchased by similar users, solely depending upon their ratings for these items.
By now, you should have a good grasp on recommendation engine concepts.
It’s the time you’ve all been waiting for – I’m now going to show you how to build a recommendation engine in R!
HOW TO BUILD A RECOMMENDATION ENGINE IN R
Phew, that was a lot! But if you’ve made it this far then you should be ready to begin looking at how to build a recommendation engine in R.
The coding demonstration
In the following demo, we’ll use the famous movielens dataset that’s been made available by grouplens research. The dataset consists of 20,000,000 distinct user ratings on about 27,000 movies. These are rated by 138,000 users. The data can be downloaded from the website here: https://grouplens.org/datasets/movielens/.
This dataset is fairly large, about 190 mb. Luckily, the website also hosts miniature versions of the movie lens data. Sizes vary from 100,000 ratings, 1 million ratings and 10 million ratings. Let’s keep it simple by using the 100,000 ratings data, which is only 1 MB. With the download, you get a zipped file containing a readme and movies data, with separate links, tags and ratings files.
Here is the link to the dataset used in the demo: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
So, how to build a recommendation engine in R… starting with the reading step in R, let’s read-in all our datasets and build a ratings matrix:
##Demo: How to build a recommendation engine in R ## setwd("C:/Users/User/Desktop/Data-Mania Blog Coding Demos/Recommendation Engine in R") #Read all the datasets movies=read.csv("movies.csv") links=read.csv("links.csv") ratings=read.csv("ratings.csv") tags=read.csv("tags.csv") #Import the reshape2 library. Use the file install.packages(“reshape2”) if the package is not already installed install.packages("reshape2", dependencies=TRUE) install.packages("stringi", dependencies=TRUE) library(stringi) library(reshape2) #Create ratings matrix with rows as users and columns as movies. We don't need timestamp ratingmat = dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE) #We can now remove user ids ratingmat = as.matrix(ratingmat[,-1])
The recommendation package in R we’ll use is recommenderlab.
It provides us a User Based Collaborative Filtering (UBCF) model. For similarity among user ratings, we have a choice to calculate similarity according to the following methods:
- Jaccard similarity
- Cosine similarity
- Pearson similarity
In this example, we’ll use the cosine similarity metric.
#Uncomment the following line if the package is not installed #install.packages("recommenderlab", dependencies=TRUE) library(recommenderlab)
First, we want to reduce the size of our ratings matrix to make computation faster. In my machine, the ratingmat takes up about 46.9 Mbs. This size is due to the large number of zero’s in the matrix (in other words, it’s a “sparse matrix”). Let’s transform into a dense matrix by removing the zero’s.
#Convert ratings matrix to real rating matrx which makes it dense ratingmat = as(ratingmat, "realRatingMatrix")
This step immediately reduced the size of the matrix to 1.7 Mbs, in my machine, which is much, much smaller. Now let’s normalize the matrix so that our our recommendations come out unbiased.
#Normalize the ratings matrix ratingmat = normalize(ratingmat)
The Recommender() function in the recommenderlab package is the underlying recommendation model we’re using here.
[warning]You may want to use the help function for this recommender to learn more about it. To do this, just enter the ‘?Recommender’ command in R.[/warning]
#Create Recommender Model. The parameters are UBCF and Cosine similarity. We take 10 nearest neighbours rec_mod = Recommender(ratingmat, method = "UBCF", param=list(method="Cosine",nn=10))
Now that we’ve built our model, let’s make some predictions.
Starting with the first user:
#Obtain top 5 recommendations for 1st user entry in dataset Top_5_pred = predict(rec_mod, ratingmat[1], n=5)
At this point, we’ve created recommendations for the first user, but we can’t see them. That’s annoying.
To see the predictions our model made, we’ll convert them to a list and print them out:
#Convert the recommendations to a list Top_5_List = as(Top_5_pred, "list") Top_5_List "47" "893" "1769" "2567" "3423"
As you can see, we get movie recommendations… but alas, they’re in movieId number format.
Let’s take a look at the movie names that correspond to these number. We’ll do this by using the movies dataset. It maps movie id to movie titles.
#Uncomment the following line if the package is not installed #install.packages("dplyr") library(dplyr) #We convert the list to a dataframe and change the column name to movieId Top_5_df=data.frame(Top_5_List) colnames(Top_5_df)="movieId" #Since movieId is of type integer in Movies data, we typecast id in our recommendations as well Top_5_df$movieId=as.numeric(levels(Top_5_df$movieId)) #Merge the movie ids with names to get titles and genres names=left_join(Top_5_df, movies, by="movieId") #Print the titles and genres names movieId title genres 1 1769 Replacement Killers, The (1998) Action|Crime|Thriller 2 2567 EDtv (1999) Comedy 3 3423 School Daze (1988) Drama 4 47 Seven (a.k.a. Se7en) (1995) Mystery|Thriller 5 893 Mother Night (1996) Drama
Based on similarity between users, for the first user, our model initially recommends the above movies.
In our results, you can see:
- The year that the movie was released.
- The movie genres.
With further data processing and filtering, we could probably improve the relevancy of the recommendations. This is so that years and genres are even more similar. Congratulations, though!! You now know the basics on how to build a recommendation engine in R.
This is just the tip of the proverbial iceberg…
Netflix uses more than 27,000 genres to classify its movies. It suggests movies based on user similarities and on movie classifications. Tons of other features (like year, age, and user demographic) are used when making recommendations. So, this tutorial really was just the tip of the iceberg when it comes to building functional recommendation systems.
Take your knowledge deeper with my LinkedIn Learning course on building recommendation systems.