dbt is an exceptional tool to use in combination with the Snowflake Data Cloud to process data. Once data is loaded into Snowflake, it needs to be transformed to make it usable for reporting and analytics purposes. dbt is widely used for performing this type of transformation.
As businesses seek to derive greater value from their data, many are incorporating machine learning into their transformation processes. Integrating machine learning models into these transformations can bring significant benefits to end users who are working with the final, transformed data.
This article will show how to integrate a pre-built sentiment analysis model into a dbt pipeline using Snowpark Python. The result will be a sentiment analysis model that can be queried by end users in the final layer of the data transformation process.
Why Does it Matter?
The increasing use of machine learning models in businesses presents a challenge of bringing these models into production and enabling user interaction. dbt has become a popular tool in constructing warehouses on Snowflake. The integration of machine learning models into dbt through the addition of Python models now allows for seamless integration of these models into dbt pipelines.
What is Snowpark Python?
Snowpark for Python is a way to interact with Snowflake using a dataframe API. This opens up a data engineer to create their transformation in Snowflake using python code instead of just SQL. With the latest feature of Snowpark Optimized Warehouses, machine learning models can be trained and deployed in Snowflake natively.
If you’re interested in learning more about Snowpark for Python, check out our previous post where we discussed the new feature when it was rolled out.
What is Sentiment Analysis?
Sentiment Analysis is a natural language processing (NLP) technique that tries to determine if data is positive or negative. It is normally used with reviews or customer feedback to categorize if the review is positive or negative.
This allows an analyst to easily build trends by reviewing data without having to manually classify the data. A data scientist can create a model to do that classification, saving the analyst time.
Why use dbt?
dbt is a tool to do transformations on data once it is loaded. It allows for a data engineer or data analyst to easily create transformations and templates that can be reused. Models are created via SQL or Python and can be materialized in various ways.
By controlling transformations via dbt, it is much easier to allow a user to build the transformation and have a CI/CD process to manage the changes to transformations.
Overall, this can lead to a better understanding of what transformations exist and to control them.
How to Setup
To start, dbt needs to be installed on the machine this project is running on. dbt Core should work (instructions can be found here) but we recommend exploring dbt Cloud as a superior option. You can even start with a free trial to explore the advantages. dbt provides a project called jaffle shop to experiment with to learn dbt.
This post will take that project and modify it to add some review data on the orders (sourced from Kaggle). These reviews were added to the seeds, so a table with 99 reviews will be loaded into the database.
To retrieve the project, start by cloning the repo here.
Once that repo is cloned, the property “profile” in dbt_project.yml will need to be updated to reflect the profile created in ~/.dbt/profiles.yml. While in the dbt profiles.yml note the schema used in the profile.
Next, update the schema to be the same value from the profiles.yml file and update the [path_to_project] to be the same path the cloned repo is in. This script can now be run via SnowSQL. The script will create a new stage that will hold the pre-trained NLP model from NLTK.
Once the model is pushed to the stage, it can be referenced in later Python scripts.
What Was Changed in jaffle shop?
The default jaffle shop project has three tables that are loaded as raw tables and are joined together to get two final tables. This version of jaffle shop adds in another seed file called “reviews” that looks like this:
This seed file is a CSV file that has an order id to tie back to the order seed and a review text that contains the review for that order.
The next step is the addition of the reviews table in the staging schema model and in the yml file. This SQL model is just renaming the review field to review_text. While the schema.yml for the staging schema is just making sure there is only one review per order_id.
Lastly, the final output models were modified and added. A new table was added called “reviews” based on the stage review table and all of the keys in the order_id column must be in the orders table and be not null and unique.
The final addition to the project is the reviews.py file which will take the stage review table, read in the model, and apply the sentiment analysis. The first part that happens is the configuration:
This is telling dbt/Snowpark that whatever dataframe is returned from this method should be created as a table. The nltk and Pandas Python packages are required to run this script. And finally, the vader_lexicon.zip file should be imported for the python script to access it.
The problem with nltk is that it wants to either download the prebuilt models or it needs them in a specific folder structure. In this case, it wants the sentiment analysis model to be in the folder structure of nltk_data/sentiment/vader_lexicon.zip.
So to deal with this, the move_files command will create a temporary folder with the correct structure and copy the model to that location:
With that done, the new temporary path can be added to the nltk data paths so nltk can find the model:
Next, the dataframe can be converted to a Pandas dataframe, the review text run through the scoring method, converted to a boolean and converted back to a Snowpark dataframe:
At this point, the data is scored when it is returned to dbt to materialize the data.
Running the New Model
To run the code, first, the seed data needs to be created as tables. To do that, run the following command:
The output should look like this:
What Did We Do?
With the models completed, looking at the Snowflake database/schema, there should be new tables and views created.
Most of these tables are the default jaffle shop tables and views, but the table “REVIEWS” is newly created from the sentiment analysis model. When we open the new table it should have the order_id, review_text, and the review_positive flag. This is seen below:
The main issue with the existing process is that all of the data needs to be pulled into memory to use the Pandas dataframe. If pulling large volumes of data into the data frame either a Snowpark-optimized warehouse can be used.
These warehouses come with a large amount of memory to aid in things like machine learning training. The other option would be to use Pandas batches instead of to_pandas. If all of the data isn’t required to be used at once, this can be used to run the process on batches of data instead of all the data at once.
With this post, we were able to demonstrate a use case for dbt Python that can provide a lot of value for getting models into production. It can also let those models be used by more end users in more ways.
So when using dbt in the future, there are quite a few more possibilities with the addition of Python models and Snowpark.
Looking for more help with Snowflake and Snowpark? As the 2022 Snowflake Partner of the Year, phData excels at helping organizations succeed with Snowflake. Reach out today for help, advice, and actionable strategies.