October 7, 2022

How to Build a Python Model in dbt with Snowflake

By Dakota Kelley

With dbt Core v1.3, dbt has introduced the ability to build your models in Python for a very select few data warehouses including the Snowflake Data Cloud.

In this blog, we’ll cover the why, when, and how of building dbt Python models and our best practice recommendations for when to use them.

Why Did dbt Implement Python Models?

dbt was originally built to be SQL first and only. With the belief that SQL would provide the most effective and accessible way to build modular, well-tested, and well-documented data transformations with business logic. So why did they add Python support to dbt?

dbt is striving to be an active participant in the modern data ecosystem as it changes. While SQL is able to lift the load of the vast majority of use cases for data transformations, edge cases (such as very complex statistics) have begun to pop up that are either very difficult or unsolvable with just SQL. 

So dbt began looking at the various data languages and settled on Python to be the next language to implement that could handle these edge cases. It was probably an easy decision when you consider how accessible and popular Python is compared to the likes of R and Scala.

However, dbt couldn’t just simply build and add Python support, they had to make sure that the DAG lineage could remain intact for any of these Python models. So they had to build in some support for tracking the lineage of data sources, processing the data frames, and then storing that data in a table—all while still allowing your standard tests, configurations, and documentation to exist for these new models. 

Remember, these Python models exist only to provide a new means to write your transformation logic.

When Should I Use Python Models?

Like dbt has recommended, it’s best to only use these dbt Python models for something that can’t be done efficiently or effectively in SQL. While the Python implementation will mature as time wears on, the primary use case is to help address those edge cases that you can’t easily solve in SQL while still allowing you to materialize the results from Python into a table in your data warehouse. 

Another use case that might make sense for a Python model, is the enrichment of your data. By this, I mean applying already developed Python libraries that already solve a problem you’re trying to solve with SQL. Think like IP Parsing, determining whether or not a date falls on a holiday, or maybe leveraging NumPy to perform some complex mathematical computations.

How to Build a Python Model in dbt?

Before we begin building Python models, we need to verify that we are using dbt Core v1.3 or greater. If you’re using dbt Cloud, you can go to your environment and configure it to use the pre-release of v1.3 or you can update your local dbt Core installation to v1.3 so you will have access to create and develop python models. Once we have the correct version of dbt, we can begin experimenting with our Python models.

Before we begin building our Python model, let’s get a data set we can begin experimenting with. For this how-to, I will generate some fake transaction data. The data will contain a single customer, with different unique transactions, and various amounts tied to the transactions. The data looks something like this:

A code snippet with fake data

Then let’s say we are tasked with generating a power set of these transactions. A power set is the set of all possible combinations that can exist of these nine transactions (it will result in a dataset that is 2^n in size, where n is the total number of records). 

We want to know the sum of the amount within each power set for some analysis we need to perform later. As an example, the set that has both transaction 1 and 2 would have a sum of 77. So there are a few things we need to do. 

We need to first establish what all of the power sets are and get the sum of the amount field for each of these power sets. Here is what a Python model that does that looks like:

What we did was start by creating a file within our models folder that is called powersets_py.py, once our model is declared as a .py script we can begin building our Python model. Let’s step through what this model is doing. 

For beginners, we will look at our import statements. Just like in Python, we can import any packages we need, and if we need to make sure a package exists for our model, there is a package config that allows us to specify the packages that should be available for a specific model. Next, let’s take a look at our user defined function power set:

A snippet of code from dbt

This function doesn’t do too much, it takes an iterable data set, converts it to a list, and uses the chain & combinations values we extracted out itertools to develop a power set for us. Say for example we had supplied the values [1, 2, 3], we would have received: { [], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3] }. 

This function allows us to generate the different power sets, which we can then call to help us build a DataFrame of our power sets. Now, let’s take a look at our model:

A screenshot of several lines of code

The syntax for building a Python model is to start by defining the model function which takes in two parameters dbt and session. dbt is a class compiled by dbt Core and will be unique for each model. Meanwhile, a session is a class that represents the connection to the Python backend on your data platform. This is used to read tables as DataFrames and to write DataFrames back to tables. 

This model function is required to return a single DataFrame. In the case of Snowflake, this DataFrame must be either a Snowpark or a Pandas DataFrame.

At the start of this function, I use the dbt class to set some config for my model including making sure the pandas package is available and setting the materialization as a table. Currently, Python models only support either table or incremental materializations. After setting the config, I call dbt and use the ref function to get a DataFrame of our values and use the to_pandas() function to convert it to a Pandas DataFrame. 

Once I have the pandas DataFrame, I can create an empty DataFrame to populate and get the power set of the Transaction IDs. With the power set in hand, I can work on creating the summations for each of the power sets and appending them to my final DataFrame that I return to dbt. 

Since we made use of the ref function our DAG will show the Python model. As an example, here is the dag showing both a SQL and Python model attached to our data set:

A screenshot with 3 buckets, the first is titled, "value," the second is titled, "powersetpy," and the third is "powersetsql"

With this model built, I can perform a dbt run. At this point, the model will connect to your cloud warehouse and run the Python scripts and insert the data into a table for us. That table is queryable using the ref functions we normally use to look at our models, like so:

Two lines of code

This of course makes it very easy to query the results of our Python model and verify that we did indeed generate a power set with the summations of the amounts for all transactions in each power set. The results look like so:

A table with two columns, one titled, "Amount" and the other titled, "Powerset"
As you can see, our Python model is running as expected and generating the result set correctly. Now you know how to build a Python model in dbt.

Closing

Building a Python model in dbt is relatively simple. All you have to do is make sure to build the model function and ensure you return a Dataframe that dbt can build into a table.

Need help getting started with Snowflake or dbt? Our team of data experts are happy to assist. Reach out today!

Pro Tips:

  1. Make sure the Python model you’re building actually needs to be a Python model.
  2. Make sure your Python model has the defined model function within it and that it is returning a Dataframe that dbt can materialize.
  3. If you have any package dependencies for your Python model, make sure to configure them with the package configuration.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit