Previously, we discussed what product recommendation systems are and why they matter for businesses. Specifically, we discussed why providing personalized recommendations to users based on their past behavior and preferences adds so much value and allows enterprises to better compete in the market.
For today’s blog, we will be diving into the actual methods for implementing recommendation systems. Once we have a good understanding of those techniques, we will begin building a demo system by loading our training data into the Snowflake Data Cloud.
What Are The Techniques Used In Recommendation Systems?
Product recommendation systems can be divided into two main categories: Collaborative Filtering and Content-Based Filtering.
Collaborative Filtering systems use the past behavior of a group of users to make recommendations for a specific user. They look for users similar to the target user and recommend items that those similar users have liked.
Content-Based Filtering systems use the characteristics or features of the items themselves to make recommendations. They recommend items that are similar to the ones the target user has liked based on features like genre, author, or description.
In practice, recommendation systems often use a combination of both Collaborative Filtering and Content-Based Filtering, known as Hybrid Recommendations.
Hybrid Recommendation combines the strengths of both the methods by taking into account the behavior of similar users and the characteristics of the items.
Practical Example of Recommendation Systems
To better understand these two techniques, let us return to the Netflix example we used in our prior blog. Netflix uses a hybrid system of Collaborative Filtering and Content-Based Filtering to make recommendations based on observations from user behavior.
Below is an architecture diagram representing a high level overview of Netflix’s recommendation system.
You will notice that Movies are assigned ratings by users. These ratings are key to determine how one user might value a movie relative to their peer’s preferences. For example, if my friends highly review Lord of the Rings, I will probably like Lord of the Rings. To process these ratings, we use Collaborative Filtering.
Under the hood, Collaborative Filtering is based on a technique called matrix factorization.
Matrix Factorization is a mathematical technique that is used to break down a large matrix into smaller, simpler matrices. This process is done by finding a set of factors that when multiplied together, produce the original matrix. There are several factorization algorithms but let us assume that we use the Singular Value Decomposition algorithm or SVD.
Singular Value Decomposition
Singular Value Decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any M x N matrix.
Now that we have reviewed the underlying statistical knowledge, let’s return to our Collaborative Filtering component for movie recommendations. First, we consume user-item interactions data, such as movie ratings, to factorize the user-item matrix into two matrices: one representing the users, and one representing the items. Then, the resulting matrices from factorization are used to find the nearest neighbors in the low-dimensional space, providing recommendations based on the users’ past behavior, and the movies/shows they liked.
The shorter the euclidean distance i.e. the closer one movie is to another, the more relevant the recommendation.
The second component of Netflix’s recommendation system incorporates the Content-Based Filtering approach which utilizes the characteristics of the movies and TV shows themselves to make recommendations. This includes information such as genre, director, actors, and keywords from the show’s description.
The system also analyzes the video and audio content of the show to understand the characteristics of the show. In order to measure similarity, we can use a technique called Cosine Similarity. This value measures the similarity between two items that ranges between -1 and 1. A value of 1 indicates that the two items are identical, while a value of -1 indicates that they are completely dissimilar.
Now that we have a good knowledge base, let’s begin building our system with the first and most important step: Data!
How Do We Source Data to Train Recommendation Systems?
Snowflake is a cloud-based data platform that provides developers and most importantly, businesses, with several benefits. For one, Snowflake is built to Scale. It can handle massive amounts of data and can scale up or down with demand. Additionally, Snowflake’s architecture is designed for performance and offers fast query processing, making it an ideal solution for data warehousing and analytics.
Lastly and a favorite of mine as a developer, Snowflake’s user-friendly implementation makes sourcing data effortless.
For these reasons, we will utilize Snowflake as our data source. In regards to the actual data, we will be using the movielens set. This is a classic for experimenting with recommendation systems. So without further ado, let’s get started!
Data Loading Tutorial
Step 1: Sign in to Snowflake or create an account.
Step 2: Download the dataset
Step 3: Run the following block of code to:
- Set the context
- Create a compute warehouse
- Create the tables
use role sysadmin; create database MOVIELENS; use schema movielens.public; CREATE OR REPLACE WAREHOUSE COMPUTE_WH WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE; create table movies ( movieid int, title varchar, genres varchar ); create or replace table ratings ( userid int, movieid int, rating float, timestamp timestamp_ntz );
Step 4: Sourcing data in Snowflake has three prerequisites: database, schema(s), and compute. Let’s review the above code to see how these three requirements are satisfied:
- First, using the role of sysadmin, we create a database called “MOVIELENS” using the public schema “movielens.”
- Next, we create a warehouse called COMPUTE_WH
- Then, we create a table called “movies”. The table has three columns: “movieid”, “title”, and “genres”. The data from this table will be used for training our content-based filtering component.
- After that, we then create a second table called “ratings”. This table has four columns: “userid”, “movieid”, “rating”, and “timestamp_ntz”. The data from this table will be used for training our collaborative-based filtering component.
Step 5: Using the Snowflake UI menu bar, switch over to the Databases tab and select the MOVIELENS database you’ve just created.
Step 6: Navigate into the MOVIES table within the database. Now that you’re inside a table, you can load data by clicking the Load Data button. You’ll want to select an XS sized warehouse to minimize usage. Once selected, hit Next.
Step 7: Navigate to the files you downloaded earlier. You should see a file named movies.csv; that’s what we’ll be using. Select it, and hit Next.
Step 8: For this file, we’ll be creating a new file format. Click the + button. Give it a name (such as “MOVIELENS”), and change the value of Header Lines to Skip to 1 and the Field optionally enclosed by double quote. Click Finish and hit Load to load the MOVIES table.
Step 9: Navigate into the empty RATINGS table, and click the Load Data button, just like before. This time, however, you’ll want to select the ratings.csv file. Luckily, you don’t need to create another file format; this file is formatted just like the previous one, so you can reuse your MOVIELENS File Format.
Step 10: Pat yourself on the back for successfully loading our training data into Snowflake!
Congratulations! You now know the core techniques that power recommendation systems AND you have begun building out a recommendation system of your own with Snowflake. Tune in for our final article in this series where we will implement these techniques as a fully functioning recommendation engine!
If you found this article helpful or just want to learn more about machine learning, reach out to phData today and see how we can accelerate your business growth with our machine learning and Snowflake consulting services.