July 12, 2022

Is dbt a Good Tool for Implementing Data Models?

By Nick Goble

Within data engineering, one of the most frequent tasks is modeling data into data marts and data products. Traditionally, if a company wanted to model some data, there were two types of resources available that could accomplish this: data engineers or data analysts.

However, the approach that a data engineer would take to model data vs. a data analyst looks very different.

In this blog post, we’ll take a look at how data has historically been modeled by both of these types of resources, introduce the concept of analyst engineering, and explain how dbt bridges the gap between traditional IT resources and business-oriented resources.  

We’ll also go through the benefits of utilizing dbt for data modeling and ultimately why it’s a great tool for data modeling.

What Is Data Modeling?

First it’s important to understand what data modeling is and the different types of data modeling.  Data modeling is part of an overall information architecture and focuses on how we define and analyze data to support business functions.  Traditionally there are three components to data modeling: conceptual, logical, and physical data modeling.

Conceptual modeling focuses on capturing the data requirements and is used to discuss requirements with the business.  This area of modeling focuses on using terms that are relevant to the business functions and areas rather than things like database names or table names.

A conceptional diagram of data modeling.

Logical modeling drills down into each of these and starts to define the structure of the data elements along with the relationships between them. This is independent of the physical database and serves as a blueprint for the data.  

In the following example, you can see in the logical data model we’ve begun to represent tables by name, the columns within those tables, and represented payment type as a separate table.

An example of a logical data model

Physical data modeling continues one step further and adds database-specific implementation details such as data partitioning, primary/foreign key constraints, indexes, triggers, or other RDBMS features where applicable.

Once the physical data model has been defined, it is up to data analysts, data engineers, and database administrators to implement it.  This requires an understanding of data definition language (DDL) to define tables, constraints, indexes, etc as well as data markup language (DML) to transform data into data products.

Let’s take a look at how data engineers and data analysts implement data models and build data products.

How Do Data Engineers Implement Data Models?

Depending on the maturity of the data engineer and the organization, how a data engineer models data is likely to vary.  These types of resources are frequently skilled in engines, tools, languages, and platforms such as:

  • Apache Spark
  • Python
  • Scala
  • Git
  • Hadoop
  • Databricks
  • Snowflake
  • Apache Airflow

These all require an in-depth understanding of how to utilize the individual engines, tools, languages, and platforms at the benefit of allowing high customization based on the business needs.  These can also be leveraged for tasks that require more than just data modeling such as orchestration, site reliability engineering (SRE) practices, collection of metrics, version control, and other distributed compute use cases.

Let’s take a closer look at data modeling within the Snowflake Data Cloud and how data engineers would approach this problem.

How Do Data Engineers Implement Data Models in Snowflake?

There are a number of different approaches a data engineer can take when modeling data in Snowflake. For data engineers who are more comfortable and familiar with Apache Spark, they may favor using Snowpark.  

If a data engineer needs to have more control over the orchestration of their data pipelines and models, they may leverage running queries from Apache Airflow.  If the data engineer wants to entirely leverage native Snowflake functionality, they may leverage streams and tasks.

Each of these approaches has its pros and cons which will have to be weighed against the overall engineering practice goals and standards.

Some of the larger challenges with data modeling with each of these approaches are:

  • Release strategies
  • Version and change tracking
  • Data lineage
  • Data profiling
  • Documentation
  • Testing capabilities

Later in this post, we’ll see how dbt allows you to solve some of these challenges.

How Do Data Analysts Implement a Data Model?

Data Analysts are people that gather and interpret data to solve specific problems.  These resources are frequently skilled in SQL, are closer to the business, and frequently have finance or business backgrounds.

There are a few ways these resources traditionally interact with data. You may have heard the term “no-code” or “low-code” before. These are types of tools that are built to lower the programming complexity necessary to perform engineering-esque tasks. Within the data modeling space, these types of solutions are frequently used by data analysts. A few popular tools include:

  • Alteryx
  • Dataiku
  • Knime
  • Matillion
  • Power Automate
  • Tableau Prep
Each of these tools gives your organization a lot more than just data modeling and solves some of the challenges that data engineers face (data lineage, data profiling, etc) but may require your organization to leverage their tool for your entire data practice to fully solve these challenges.
Organizations will frequently start their data practice with these types of low/no-code solutions. It’s easier for analysts who are very close to the business needs and data usage to model data products more than having the business translate needs to an engineering team. However, utilizing some low code tools may result in large trade-offs including:
  • Expensive licenses
  • Vendor lock-in
  • Limited version control and change tracking capabilities
  • Lack of CI/CD process
  • Limited customization
  • Difficult to follow proven software development best practices

These tools aim to enable users with business domain knowledge to approach building data products in a self-service fashion. When the data product achieves higher criticality within the organization, it can be productionalized in dbt so that it can leverage the software development and maturity model that production code requires.

As your organization scales, it’s common to have the need to customize data models and pipelines, add SRE capabilities, and testing capabilities further than these tools typically can support.

How Do You Implement a Data Model with dbt?

Up to this point, we’ve covered two approaches to data modeling: one typically employed by data engineers, and the other by data analysts. However, your organization or team’s skill set may fall somewhere between those two.

This has led to the coining of a new term: analytics engineering. An analytics engineer’s role is to transform the organization’s data to be positioned for analysis, bringing clean and reliable data to the business in a format they can access and understand.

dbt is built towards this persona (and initially coined the term). This persona will typically have traits of both the data analyst and data engineer. The analytics engineer is often armed with the business acumen and strong data skills of a data analyst, and a data/software engineer’s best practices, such as version control and execution against central data warehouses.

To implement dbt, data analysts may need to skill-up in git and partner with IT for initial setup, but they will find it gives a familiar SQL interface for building out data models. Data engineers can also leverage dbt for managing and executing data model transformations in their existing pipelines with few trade-offs and some skill-up in the business context of the data.

Let’s take a look at an example data model being built with dbt. Next, we’ll explore how each of these personas would test and promote their changes to production.

Implementing Data Models in dbt

One of the first things you’ll need to determine is whether you’re using dbt Cloud (managed service) or dbt Core (open source software). The main functionality of dbt is available in both solutions with dbt Cloud giving you more flexibility, a dbt integrated development environment (IDE), and scheduling capabilities.  

As a general recommendation, if you’re a data analyst or data engineer, use dbt Cloud. If you’re a data engineer, you can likely leverage dbt Core paired with your existing stack.

To start data modeling with either version of dbt, you’ll need to know a few things:

  • Where your data resides
  • If you want to create views or tables
  • If your data needs to be sorted or distributed by certain columns
  • If your data model is incremental or full load
  • What your testing requirements are
  • What type of data modeling you need in your data product (tabular, dimensional, data vault, etc.)
  • Determine a git repository

Within a dbt project, we have two different types of files: config files and SQL files.  

Config files are used to do things such as define sources, configure paths, set versions, and tell dbt how we want our models to be built in the target environment.  These configs mainly live in your dbt_project.yaml file.  You can find an example of this configuration here.

SQL files are where you define your data models. These generally take the form of a configuration block (using Jinja), common table expressions or other temporary tables, and then a final select statement that gives us our transformed result set.  You can find an example of this here.

In the above SQL example, if you’re unfamiliar with dbt or Jinja, you may have noticed this odd-looking syntax:

				
					select * from {{ ref('stg_customers') }}
				
			

This is where the major value of dbt starts to come into play. So, what is going on here? The double curly brackets here are Jinja syntax and it’s introducing functionality that tells dbt we need to reference another dataset within our data source, in this case, “stg_customers”.  This tells dbt how different pieces of data are used and gives us the ability to determine things like data lineage and how to execute our SQL files.  

For example, if we need table A to be defined before we execute our model, we can simply include table A as a ref in our model, and dbt will know to build that table before executing our model. dbt builds this into what is known as a directed acyclic graph or DAG.

Example of a dbt DAG

Deploying Changes to a Model with dbt

Once you’ve made changes to the existing data models or added new data models, you’ll need to implement and deploy those changes to your target environment.  How you do this will depend on your specific dbt setup. This could look like this:

  • Executing dbt run on your local machine that’s connected to your source/target system
  • Executing dbt run as part of your CI/CD pipeline (Github Actions, Jenkins, etc)
  • Leveraging dbt cloud to run your changes against a development connection via the cloud IDE
  • Leveraging dbt cloud to run your changes against a development connection using git hooks and a git pull request

One thing you’ll also want to consider when deploying your changes is testing your models as well. dbt gives you the ability to define expectations that your data should follow and a command to invoke your test suite. Tests come in two categories: singular and generic.  

Singular Tests

A singular test allows you to write a query that you expect should return 0 records. For example, if you expect all your invoices to be greater than or equal to zero, you could write a query that would return invoices with a value less than 0, and if results are returned, your test fails.  

Generic Tests

Generic tests allow you to write parameterized SQL statements that can be reused across multiple models or multiple times for the same model. dbt also provides you with some generic tests out-of-the-box for common things like checking for accepted values, whether values contain null, and whether values are unique.  

If you’re leveraging dbt cloud, you’ll configure your dbt job to run dbt test as part of its execution.  Otherwise, you’ll need to manually invoke this via dbt core.  

Generating Documentation with dbt

Documentation is an incredibly important and valuable piece of data modeling that organizations frequently overlook. Documenting information about your data points in your model, what they’re used for, where they came from, and even their data types enable your organization’s data governance strategy and information architecture. 

Many low/no-code tools build this for you, but data engineers frequently need to leverage another tooling to capture this information. dbt gives you a one-stop-shop for building data models, testing them, and documenting them.

If you’re leveraging dbt Cloud, this documentation can be generated automatically based on your job configuration. If you’re using dbt Core, you’ll need to manually generate the documentation and serve it to view in your browser.

Example of served documentation

dbt is also working to add support for entity relationship diagrams (ERD) documentation to show how tables relate to each other.  Check this out!

So Is dbt a Good Tool For Implementing Data Models?

Yes! dbt is a fantastic tool to add to your organization for implementing data models. It also provides functionality that supplements data governance and other aspects of an information architecture.  

dbt empowers both traditional data engineers and data analysts while allowing your organization to follow best practices such as version control, testing, and documenting your data models and transformations.  

Organizations can leverage the open-source variant of dbt to keep license costs low or they can leverage dbt Cloud for simplicity and ease of use. While there is some vendor lock-in by configuring your data models in dbt, the majority of your code is standard SQL which can easily be ported to another platform if you decide later down the road.

Have dbt questions? We have answers! Reach out to the dbt experts at phData today for unbiased advice, helpful insights, and answers to your toughest questions.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit