What Are dbt Execution Best Practices?

One of the more common practices when developing a data pipeline is rebuilding your data for testing changes. As one of the leaders in the industry, dbt provides several options on how to execute your pipelines to increase efficiency and specifically execute what you need.

In this blog, we’ll discuss the basics of dbt build, run, and test, when you should use each of the commands, and finish with some tips on how to optimize your pipelines leveraging dbt.

What is dbt Run + dbt Test?

The first method of execution with dbt is using a combination of dbt run and dbt test. The dbt run command executes compiled SQL model files against the current target database, creating or replacing all the tables and views in your data warehouse.

You can also pick and choose to run or exclude specific models using the selection syntax like dbt build, which is very useful when developing and testing new models. No checks are performed before or after execution, so you must do that separately with the dbt test.

The dbt test runs tests you have created on models, sources, snapshots, and seeds that have already been created. It also provides you with the same selection syntax as the build and run commands, allowing you to run only specific types of tests or tests on specific models.

What is dbt Build?

The second method of executing a data transformation pipeline with dbt is using dbt build, which is a combination of running dbt run, dbt snapshot, dbt seed, and dbt test. The dbt build command will first run and then test your resources immediately, including models, seeds, snapshots, and tests, in a directed acyclic graph (DAG) order.

The build will create a single manifest with representations of all of your resources and then outputs a single run results artifact at the end detailing the output from each resource during the execution of the build command with things like time to run and test failures. If there are any failures in upstream resources, the execution of downstream resources will be skipped.

So if you have model_3, which is dependent on model_1, and a test fails for model_1, the execution of model_3 will be skipped. The build command also allows you to select specific resources with the standard selection syntax (–select, –exclude, –selector, and –resource-type) or can run for the entire project.

When to use dbt Build vs dbt Run + Test

Each execution has its own strengths that can be leveraged in your projects. dbt build is most commonly used as an all-in-one command for more complex projects that require compiling, testing, and snapshots. If you need to ensure a certain level of data quality, dbt build is best, as it includes testing as part of the run.

On the contrary, dbt run works better for more targeted executions when you need more control over individual tasks. One of those use cases is when you are actively developing a new model, using dbt run, and the selectors allow you to run only that model until you are satisfied with it, excluding any downstream tests until you are ready for them. Testing is a separate function from the dbt run, as you must execute the dbt test after the models run separately.

Especially when testing and quality are an important part of your development workflow, dbt build can simplify things greatly and give you peace of mind as it takes care of the testing by default.

To summarize, when utilizing pipelines that run with automation, things like CI/CD or production refreshes, you will likely want to use dbt build. While there are cases when dbt run + test can work in those scenarios, they are very infrequent, and that flow should mainly be used in active development or environments that are not production.

Tips for Optimizing Your dbt Pipelines

Aside from choosing which commands to use for running your pipelines, dbt also offers a number of other features to optimize these pipelines. The two we will call out here are selector methods and data manipulation.

Selector Methods

Selector methods are one of the most powerful ways to add filters to your project so you only run the necessary resources. There are currently 19 different selector methods, with some of the most powerful being tag, config, group, state, result, and source_status.

Tag

The tag method is used to select all resources that match a specified tag. This allows for the grouping of resources that should all run at the same time and allows for a customized selection of resources to run so that each group is not running unnecessarily and costs more.

				
					dbt test --select "tag:cicd"

In the above example, dbt will run all the tests which are tagged “cicd”. This can be especially helpful in situations where you have regression tests and only want them to run on CICD runs.

Config

The config method will select all resources that match a specified configuration parameter.

Given a configuration block of:

				
					{{ config(
	materialized = "incremental"
	unique_key = ['id', 'column_x']
	transient = true
) }}

You can run a variety of select statements to select this and other similar resources:

				
					dbt run --select "config.materialized:incremental"
dbt run --select "config.unique_key:id"
dbt run --select "config.transient:true"

The first statement will select any model materialized incrementally, the second will select models with a column named id as their unique key, and the third will select any transient models. This selector is able to match booleans, dictionary keys, and values in lists, as well as strings.

Group

The group method is similar to the config method, where you can assign a group to your resources in the config. By grouping models, you can limit their access.

				
					{{ config(
	materialized = "incremental"
	unique_key = ['id', 'column_x']
	transient = true
	group = "Group_1"
) }}

With the above config, you can run:

				
					dbt run --select "group:Group_1"

This command will run all the models in Group_1.

State

The state method is one of the most powerful that dbt offers. It works by comparing nodes in the current project against a previous version of the project. It leverages the project manifest to determine what is new, modified, unmodified, and old.

Some examples of the method being used are:

				
					dbt test --select "state:modified"
dbt run --select "state:new"

In the first statement, dbt will execute new or modified tests and tests on new or modified resources. The second statement will execute only new models, meaning they have no matching unique ID in the manifest file from the previous version.

The state:modified selector also includes the capability to include sub-selectors to drill further into the modified objects, allowing you to select by body (SQL of the models, seed values), configs, relation (what the resource looks like in the database), persisted descriptions (relation changes or column level descriptions), macros, and contract (which include names and data types of columns).

On the flip side of the modified and new states, there are the unmodified and old states, which represent all resources with no existing changes and have a matching unique ID in the manifest. Together, these 4 selectors can save you a lot of time executing your pipelines. They are very important to Slim CI, as they allow for running and testing of fewer models based on the value of state selector, which can save a lot of time and compute on your job executions.

Result

The result method selects resources based on their result status from a previous run of a dbt command, either run, test, build, or seed. The selector will only work correctly if the command has been run prior to the current run.

The two states to select are error and fail, and some common uses of this selector would be to re-run a dbt build. Selecting all the failed models or running all that errored out on the prior run.

				
					dbt build --select result:error+ result:fail+ --state path/to/artifacts

The example command will rerun all models and tests with errors or failures and anything downstream from those models based on the state of the last run provided in the --state argument.

With these selectors, you can enable your pipeline to have smart reruns on just the resources that were not successful rather than troubleshooting and having to execute all the resources again, saving time and computing.

If you are a more advanced dbt user, dbt also provides a dbt-cloud API endpoint for smart reruns of a job, executing from the point of failure rather than the full job again.

				
					# [POST]
https://cloud.getdbt.com/api/v2/accounts/{account_id}/jobs/{job_id}/rerun/

By substituting your account ID and the job ID of the failed job into this URL, you will kick off another run from the point where the original failed, or it will start a full new run if the prior run was successful. The response from the API will contain a boolean of if the run was queued on the account. If you are unfamiliar with the dbt-cloud API and want to learn more, check it out at dbt-cloud API.

Source Status

The source_status selector relies on dbt source freshness. dbt allows you to add a freshness block to your source’s config which will specify the amount of time that can pass before the source should be refreshed.

				
					version: 2

sources:
 - name: database_1
   database: source
   freshness:
     warn_after: {count: 24, period: hour}
     error_after: {count: 48, period: hour}
   loaded_at_field: _etl_loaded_at

   tables:
     - name: table_1
       freshness:
         warn_after: {count: 6, period: hour}
         error_after: {count: 12, period: hour}

     - name: table_2

     - name: table_3
       freshness: null

In the above example, there is a freshness block added to the source database_1, which will give a warning after 24 hours, and an error after 48 hours based on the field etl_loaded_at. The 2 values of warn_after and error_after are required, or dbt will not calculate the source’s freshness; by default, these values will be applied to everything further downstream.

In our example, the freshness values will be applied to table_2 since no freshness is defined. table_1 and table_3 each specified values for freshness so that the default will be overwritten. As of dbt 1.7, you no longer have to provide a loaded_at_field to calculate freshness. If the field is not provided, the freshness will be calculated using warehouse metadata tables.

Calculating freshness for a source is done with the command:

				
					dbt source freshness

The output from running this command will give you a date value used to compare with the previous source freshness date. You can add a filter condition to the above command to prevent a full table scan from occurring when dealing with large amounts of data.

With this date, you can leverage the source_status method.

				
					dbt build --select source_status:fresher+ --status path/to/artifacts

This command will take the results of the first command and rerun and test any of the models that have had their sources updated since the last run, skipping anything that does not need to be rerun as the underlying data has not been updated.

These commands are especially powerful when used in combination with each other, such as:

				
					dbt source freshness
dbt build --select state:modified result:fail source_status:fresher+ --status path/to/artifacts

These commands will calculate the current freshness of the sources and run any models with refreshed source data, containing modifications, or failures on the prior execution. By executing only these models that have changes, you can optimize your runs to waste no time or compute resources that don’t need attention.

Data Manipulation

One of the other powerful ways to optimize your pipelines with dbt is by manipulating the dataset to work with more appropriate-sized data samples. You can create a macro to grab a sample of the data set, as shown in the example below.

				
					{% macro sample(n, sample_type='ROW') %}
    {% if target.name != 'PROD' %}
        SAMPLE {{ sample_type }} ({{n}})
    {% endif %}
{% endmacro %}

This macro checks if the target is the production environment, and if not, it takes a sample of n rows from the dataset to execute the model to minimize compute and storage being used in lower environments.

You can also optimize testing on large volumes of data by performing sliding windows of test scanning. The example below demonstrates how you can use macros to substitute dynamic values into the config to work with data from the last 7 days, eliminating the need for hardcoding and updating values.

				
					version: 2
models:
  - name: my_model
    columns:
      - name: id
        tests:
          - unique:
              config:
                where: "date_column > __window__"

				
					{% macro get_where_subquery(relation) -%}
    {% set where = config.get('where') %}
    {% if where %}
        {% if "__window__" in where %}
            {# replace placeholder string with result of custom macro #}
            {% set window = dbt.dateadd('day', -7, current_timestamp()) %}
            {% set where = where | replace("__window__", window) %}
        {% endif %}
        {%- set filtered -%}
            (select * from {{ relation }} where {{ where }}) dbt_subquery
        {%- endset -%}
        {% do return(filtered) %}
    {%- else -%}
        {% do return(relation) %}
    {%- endif -%}
{%- endmacro %}

Closing

With much concern around the cost efficiency of data pipelines in a continually growing data space, dbt has offered various solutions to put your mind at ease. Whether it be different ways to execute tasks, methods of filtering resources, or creating subsets of your data, there is no shortage of ways to ensure you are running what you need and using only the data you need across your entire project.

If your organization is looking to succeed with dbt, phData would love to help!

As dbt’s 2023 Partner of the Year, our experts will ensure your dbt instance becomes a powerful transformation tool for your organization.