October 7, 2022

Automating Your Transformation Pipeline with dbt

By Dakota Kelley

So you’ve built your first set of transformations in dbt, but now you need to figure out how to automate your deployment and code changes to your various environments. 

However, you’re not sure where to even start planning, let alone making sure that you’re sticking to best practices (whether that’s running your code on a schedule or having it run based on certain actions within your git repository).

Thankfully, you’ve found this blog!

In this article, I’ll show you how to set up a run in both dbt Core and dbt Cloud, while also discussing best practices. 

I’ll start by covering some basics about overloading the built-in functions that handle schema and database naming. Please note that you might need to change these macros, especially if you have a multi-schema or multi-database setup with a defined naming structure (such as dev_*, test_*, prod_*, etc). 

After that, I’ll cover how to set up a job in dbt Cloud, how to schedule the job, and even calling the job to run from the API. 

Finally, I’ll show you to configure a run for dbt Core.

Specifying dbt Run Location

One of the easiest and cleanest ways to make sure that your dbt pipeline runs into the correct environment is to overload the macro built into dbt that sets the schema and database name. The “how” will depend on the way that you decide to architect your data warehouse.

For this example, let’s say that we are using a single Snowflake Data Cloud account. Within that Snowflake account, we will have various databases to house our transformations, and each database name will start with the environment. 

Let’s say that we have a data product called “sales” within Snowflake that has a naming convention like dev_sales, stg_sales, prod_sales, etc. We want our deployments to write to the corresponding database. Like writing to dev_sales for any of our development work, our goal is to have automated deployments that run our transformations into our stg_sales and prod_sales databases based on certain activity within our git repository.

So what’s the best way to address this? 

There are several different ways to do this, you can use jinja within your dbt_project.yml to change the database based on either an environment variable or the target name for the profile or job. Another way is to overload the generate database macro and have it append that environment variable or target name to the front of your database name. 

To do that, you just need to create a macro in your project called generate_database_name. This will cause dbt to use your own custom database name generation instead of the built-in generation. For this example, let’s say I’m going to use the target name. Here’s what that would look like:

A screenshot of a code snippet from dbt

As you can see, we’ve overloaded the generate database name function to append our target name to the front of our database defaults or custom settings in our dbt_project so long as we are running for the stage or prod database, otherwise everything goes to a dev database. 

You can do something similar to the schema name if you prefer to separate environments at the schema level. With that said, I highly recommend you use different databases for your various environments instead of using schemas to separate your environments. 

What I recommend is configuring schema names for your development work so that your developers aren’t stepping on each other’s feet as they perform changes. This allows you to have schemas that might have someone’s name in them for their personal development. This would look something like this:

A code snippet from dbt

For the non-automatic deployments, we append the default schema to any of our custom schema names. What I usually do is configure this to cause a schema to look something like dbt_dkelley_model in dev when I’m doing personal runs, but then any automated runs into STG or PRD would go into a schema just called model. 

This is great because let’s say we have two users called user1 and user2. If they both perform a dbt run at the same time, whoever ran last is whose code will be in the schema. But with this separation, you would instead see a dbt_user1_model and dbt_user2_model with their respective changes, which helps reduce confusion. 

You could also configure the generate_database_name macro to write to a personal workspace instead of the defined databases when someone is working on their own code.

dbt Cloud Deployment

Now that we have our environment configured, confirmed that our code will run into the correct databases based on a run, we can begin working on building a deployment job in dbt Cloud. To do this, we’ll start by going to the deploy tab and creating a new environment. 

For this environment, I would make sure that you have a service account created with a key pair assigned to it for you to configure. Think of the environment as the profile that you’re creating, you’ll give it a name, a dbt version, provide a custom branch (if you want but not required, this will default to main), provide your deployment credentials, and the deployment connection. 

Once your environment exists, we can create a job.

To create a job, go to Deploy -> Jobs and on this page, you can click Create Job within the environment you want to deploy. For this example, I’m using an environment called TEST. See below:

A screenshot with a pop up that says "Jobs" and a button that says, "create job"

Once you’re creating the job, you can give the job a name, specify the environment, the version, the target name (if you’re using that to specify database or schema, make sure this is set correctly), the thread count, and any environment variables (again if you’re using these to specify your database or schema, make sure you create them here). 

Once you’ve configured these values, you will scroll down to the execution settings. Here you can specify the commands and what order you want them to run in to perform the various tasks, tests, and freshness scans. Once this is populated with your instructions, it is time to specify a trigger. 

By default, dbt Cloud has three triggers: schedule, webhooks, and API. The schedule allows you to configure a cron schedule of how often and when you want the job to run. The Webhooks are great for setting up CI tests to run based on pull requests into your git repository while the API allows you to kick off jobs from an API call. 

For this example, we’ll turn off the schedule and use the API to cause our code to be deployed into an environment based on a merge request.

Before we go down this path, you will want to visit the dbt Cloud API documentation and get familiar with the different actions you can take on your job. For this example, we’ll use GitHub actions since this repo is set up within GitHub. 

You should be able to perform similar tasks within other repository deployment tools. We will start by creating a set of folders within the root of our dbt project .github/workflows/ within this workflows directory we can set up the yml for our deployment. As an example, here’s how my continuous_deployment.yml file looks that I created within my workflows directory:

A screenshot of some code in dbt

I know there is a lot going on, so let’s start at the beginning. I’m naming this workflow, and then I specify that this workflow should run on any push into any of my release branches. Next, we begin creating the deployment job. We give the job a name, the OS for the vm to run on, and the steps to perform. 

Our steps are to checkout the repo, extract the branch name, and then we use curl to hit the API and call our job to run on a particular branch. You’ll notice within that curl call, we make sure to take advantage of our secrets so we don’t accidentally store an API token, account number, or job number within our git repo. This allows us to protect important information while calling the API and overloading the job. 

There are many attributes you can provide within the data option that allow you to change how a job operates, but for this example, we just use it to specify the branch we want to run.

dbt Core Deployment

The nice thing about dbt Core is that it can be similarly as easy to create a workflow to execute our code. There are just two things to keep in mind, if we’re working with dbt Core, we will need a profile to exist, and we can use a similar workflow setup to handle executing our dbt models

Let’s start similar to the dbt Cloud deployment and create a .gihub/workflows/ folder within the root of our project so we can create our workflow. In the case of dbt Core, that workflow will look something like this:

Screenshot of some dbt code

This workflow looks very similar to the one created for dbt Cloud with some minor changes.

Next, we’ll want to create some environment variables that are populated with our GitHub secrets so that way we aren’t storing credentials within our repo. Next, we set up our VM, have it install Python, use pip to install dbt, run dbt deps to make sure we have our dependencies, and then we run dbt build to execute our code. 

The only thing left to do is to create a profiles.yml file that is populated with our environment variables, like so:

As you can see, we are just referencing the environment variables that we create in our environment so we can avoid populating our repository with any credentials.

Closing

Put all of this together and you will have an automated deployment pipeline to execute your transformations on any schedule or action that occurs within your dbt project.

Need help getting started with Snowflake or dbt? Our team of data experts are happy to assist. Reach out today!

Pro Tips

  1. Make use of the ability to overload the name generation macros to help force your dbt models to run into the correct place.
  2. Make sure to protect your sensitive information using the provided secrets so you don’t accidentally commit a username, password, or API token into your repository.
  3. If you have any package dependencies for your python model, make sure you run the dbt deps step.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit