dbt Labs’ Coalesce 2023 Recap

Welcome to the lowdown on the Coalesce Conference, where the data community’s pulse thumps at its strongest, hosted by dbt Labs. Coalesce is where the brightest minds in data — from data practitioners to the largest companies — converge to share, inspire, and reshape the landscape with dbt, the gold standard tool for data transformation.

The conference was filled with various sessions, each offering a deep dive into dbt’s role in data strategy and innovation. Attendees don’t just watch; they immerse themselves in hackathons, tackle certification exams, and enjoy fun, happy hours. There is a palpable sense of anticipation for the unveiling dbt’s new features and a celebratory cheer for the victors in the awards ceremonies.

In this blog, we’ll spotlight the transformative announcements that emerged from the Coalesce Conference. Join us as we navigate the key takeaways defining the future of data transformation.

dbt Mesh

Enterprises today face the challenge of managing massive, intricate data projects that can slow down innovation. In mid-2023, many companies were wrangling with more than 5,000 dbt models.

Figure 1: Number of companies handling more than 5,000 models. Source: The next big step forward for analytics engineering

Picture the hustle it takes to keep that many models in line—ensuring they’re reliable, the dependencies make sense, and the data is solid. It’s a jigsaw puzzle on a massive scale, and every piece has to be just right. That’s no small feat for any data team!

dbt Mesh arrives to solve these problems, providing a structure that segments data projects into clear and manageable smaller portions, where each team takes charge of their own data domain while sharing it across the company. But you can be a small company to use dbt mesh. No matter how big you are, you can leverage dbt mesh to get:

  • Empowerment through Decentralization: Teams get domain-specific autonomy, allowing them to build and own their business logic without cross-team dependencies that often lead to bottlenecks.

  • Streamlined Processes: Domain-level teams operate independently, governed by standards that ensure both quality and compliance while enabling faster development and delivery of data products.

  • Enhanced Collaboration: dbt Mesh fosters a collaborative environment by using cross-project references, making it easy for teams to share, reference, and build upon each other’s work, eliminating the risk of data silos.

  • Trust and Quality at the Forefront: Domain experts are empowered to create business logic that they understand best, ensuring the delivery of quality data products that stakeholders can trust.

But don’t think dbt Mesh is a standalone product; actually, it’s a pattern powered by a suite of interconnected dbt features:

Cross-project references: This is what enables multi-project deployments. The {{ ref() }} macro can now be used across dbt Cloud projects on Enterprise plans. You just need to pass two arguments instead of one, with the first being the project’s name.

				
					with my_cte as (
    select * from {{ ref('another_project', 'downstream_model') }}
),
...
				
			

Governance: With dbt’s new governance capabilities, there’s precise management of who gets to access and interact with your dbt models, both within a single project and across multiple ones. For that, in your models, .yml configuration file, use the group (allows the categorization of models to specific groups within a project) and access (determines who can reference models) keys.

				
					# First, define the group and owner
groups:
  - name: my_group
    owner:
      name: My Group's Name
      email: my_group_email@jaffle.shop

# Then, add 'group' + 'access' modifier to specific models
models:
  - name: my_dimension_model
    group: my_group
    access: public # Public models can be referenced by any group/team/project, they should be stable & mature

  - name: my_intermediate_model
    group: my_group
    access: private # Private models can only be referenced by models in the same project

  - name: my_staging_model
    group: my_group
    access: protected # Protected models can only be referenced by models in the same group
				
			

Model Versions: You can now treat data models like stable APIs when coordinating across projects and teams, and model versioning is the structured way to manage the lifecycle of models as they grow and change.

				
					version: 2

models:
  - name: model_name
    versions:
      - v: <version_identifier> # required
        defined_in: <file_name> # optional -- default is <model_name>_v<v>
        columns:
            # specify all columns, or include/exclude columns from the top-level model YAML definition
          - include: <include_value>
            exclude: <exclude_list>
          # specify additional columns
          - name: <column_name> # required
      - v: ...

    # optional
    latest_version: <version_identifier> 
				
			

Model Contracts: Data contracts are the reliability piece of the mesh. They establish clear expectations for data structure, safeguarding against disruptions caused by upstream changes in dbt or within a project’s logic and ensuring downstream processes remain unaffected. For that, you can now, in the model’s .yml configuration file, define the data_type and constraints (such as null and unique) keys.

dbt Explorer

Managing thousands of models is about more than just keeping them reliable; it’s also a real headache when you’re trying to figure out how they all connect. dbt Mesh comes to the rescue by slimming down those complex connections, making your lineage graph much more manageable but seeing less of the overall picture.

Figure 2: Lineage graph of a project with thousands of models. Source: Coalesce 2023 Product Spotlight and Keynotes

With dbt Explorer, you get rid of these problems. You now have the flexibility to view your lineage in full detail or drill down to specific sections, all through a straightforward and intuitive interface. It’s about more than just looking at one project; dbt Explorer lets you see the lineage across different projects, ensuring you can track your data’s journey end-to-end without losing track of the details.

Figure 3: Multi-project lineage graph with dbt explorer. Source: Dave Connor's Loom.

dbt Explorer is a feature-rich tool for users with multi-tenant or AWS single-tenant dbt Cloud accounts on the Team or Enterprise plan, providing comprehensive lineage and metadata analysis capabilities. Here’s a summary of its core features:

  • Prerequisites: Requires a dbt Cloud account and successful job runs in a production environment for exploration.

  • Metadata Generation: Utilizes metadata from the Discovery API to display the current project state, updating automatically after each production job run.

  • Lineage Graph Visualization: Offers interactive lineage graphs of project DAGs, with color-coded nodes and iconography for different resource types, which can be zoomed, panned, and explored in detail.

  • Resource Interaction: Allows for detailed inspection of resources, including their metadata, by hovering, selecting, and interacting directly with the lineage graph.

  • Search Functionality: Supports keyword and advanced selector method searches to navigate resources efficiently, including graph and set operators for refined queries.

  • Sidebar Navigation: Provides a catalog sidebar for browsing resources by type, package, file tree, or database schema, reflecting the structure of both dbt projects and the data platform.

  • Version Tracking: Displays version information for models, indicating whether they are prerelease, latest, or outdated.

  • Detail Inspection: Enables access to the definition and latest run results of any resource, with a detailed view that includes a status bar, various informational tabs, and a model’s lineage graph.

  • Cross-Project Lineage: Shows project-level lineage, detailing cross-project resource usage and dependencies, with different iconography for upstream and downstream project relationships.

  • Project-Level Exploration: Allows for the examination of all projects in an account, providing search capabilities and the ability to view the details of public models and project dependencies.

Figure 4: Multi-project lineage graph with dbt explorer. Source: Dave Connor's Loom.

dbt Explorer stands out as a versatile platform for users to manage and understand complex dbt Cloud projects, providing insights into the lineage and metadata that drive data transformation projects.

Read more about the dbt Explorer: Explore your dbt projects

dbt Semantic Layer: Relaunch

The dbt Semantic Layer is an innovative approach to solving the common data consistency and trust challenges. It leverages dbt, a popular transformation solution utilized by thousands of organizations, to create a well-defined, clean data foundation essential for an effective semantic layer. 

The dbt Semantic Layer facilitates collaboration, version control, and smooth migration between different tools and platforms by integrating with various data platforms and supporting metric definitions in code.

This layer is enriched by the integration of MetricFlow, which further sophisticates the metric framework. The dbt Semantic Layer allows for defining semantic models and metrics on top of dbt models in a code-based environment. At query time, it dynamically generates a semantic graph that connects all the defined metrics and dimensions, making them readily available through robust APIs for downstream tools. 

This ensures that every member of the organization has access to a consistent, transparent, and context-rich understanding of key metrics, empowering them to make informed decisions and independently explore data with confidence. 

Here’s a quick summary of the dbt Semantic Layer:

  • Centralized Metric Definitions: The Semantic Layer becomes the single source of truth for metrics across your organization, reducing redundancy and ensuring uniformity in reporting.

  • Streamlined Metric Creation and Management: With MetricFlow, you can easily establish and oversee company metrics through flexible abstractions and SQL query generation.

  • Efficient Data Retrieval: Quick access to metric datasets from your data platform is made possible by MetricFlow’s optimized processes.

  • Seamless Integration with Downstream Tools: The setup process is tailored to enable consistent metric access across a variety of analytics and business intelligence tools.

    • Tableau (beta)

    • Google Sheets (beta)

    • Hex

    • Klipfolio PowerMetrics

    • Lightdash

    • Mode

    • Push.ai

    • Delphi

  • Prerequisites and Compatibility: It requires a dbt Cloud Team or Enterprise account and supports popular data warehouses like Snowflake, BigQuery, Databricks, and Redshift.

  • Development and Deployment: You’re guided through creating semantic models, defining dimensions and measures, and configuring your deployment environment to harness the full potential of the Semantic Layer.

  • Customized Querying Capabilities: Metrics defined can be queried directly in dbt Cloud or via API in downstream tools, with the necessary configurations outlined for both dbt Cloud and Core users.

  • Service Token Authentication: To connect and query metrics in downstream tools, a service token with specific permissions is required, ensuring secure access.

  • Typical Scenarios:

    • Business intelligence (BI), reporting, and analytics

    • Data quality and monitoring

    • Governance and privacy

    • Data discovery and cataloging

    • Machine learning and data science

Have a look at a complete semantic model in the new dbt Semantic Layer from dbt Docs. The semantic models are defined in the model’s .yml configuration file. This process begins with the establishment of model defaults. Subsequently, entities are specified—these typically are ‘ids.’ 

Finally, dimensions and measures are defined to fully encapsulate the data’s characteristics. Dimensions serve as the non-aggregatable properties of your records, offering categorical or temporal context that enhances the interpretability and richness of your metrics. 

Measures, on the other hand, are the building blocks upon which metrics are constructed. Through the aggregation capabilities of MetricFlow, these measures are transformed into insightful and actionable metrics.

				
					semantic_models:
  #The name of the semantic model.
  - name: orders
    defaults:
      agg_time_dimension: ordered_at
    description: |
      Order fact table. This table is at the order grain with one row per order.
    #The name of the dbt model and schema
    model: ref('orders')
    #Entities. These usually correspond to keys in the table.
    entities:
      - name: order_id
        type: primary
      - name: location
        type: foreign
        expr: location_id
      - name: customer
        type: foreign
        expr: customer_id
    #Measures. These are the aggregations on the columns in the table.
    measures:
      - name: order_total
        description: The total revenue for each order.
        agg: sum
      - name: order_count
        expr: 1
        agg: sum
      - name: tax_paid
        description: The total tax paid on each order.
        agg: sum
      - name: customers_with_orders
        description: Distinct count of customers placing orders
        agg: count_distinct
        expr: customer_id
      - name: locations_with_orders
        description: Distinct count of locations with order
        expr: location_id
        agg: count_distinct
      - name: order_cost
        description: The cost for each order item. Cost is calculated as a sum of the supply cost for each order item.
        agg: sum
    #Dimensions. Either categorical or time. These add additional context to metrics. The typical querying pattern is Metric by Dimension.
    dimensions:
      - name: ordered_at
        type: time
        type_params:
        time_granularity: day
      - name: order_total_dim
        type: categorical
        expr: order_total
      - name: is_food_order
        type: categorical
      - name: is_drink_order
        type: categorical


				
			

Once your semantic models are set, launch a production job in dbt Cloud to materialize your metrics. Then, configure the dbt Semantic Layer on both environment and project levels. This setup enables querying of your metrics through JDBC tools or direct integration with the dbt Semantic Layer.

The promise of dbt’s Semantic Layer lies in its ability to democratize data understanding and access, thereby reducing data inequality and fostering a unified approach to data-driven decision-making across an organization.

Read more about the dbt Semantic Layer at Explore the dbt Semantic Layer.

dbt Cloud CLI

For those who miss their code editor when working with dbt Cloud, the platform’s CLI feature bridges the gap between the comfort of your local development environment and the robust capabilities of dbt Cloud. 

It empowers teams to run dbt commands directly from the local command line, enabling a mix of familiarity and advanced cloud functionality.

  • Seamless Local-Cloud Integration: Work with dbt Cloud straight from your local setup, executing commands without leaving your beloved code editor.

  • Secure and Convenient: Credentials are safely managed on the dbt Cloud, simplifying your workflow while keeping security tight.

  • Collaborative Development: With support for dbt Mesh, the CLI promotes effective teamwork through cross-project references.

  • Optimized Builds: Experience the advantage of quicker and more cost-efficient builds courtesy of dbt Cloud’s infrastructure.

A screenshot of the coding when sending a project to dbt Cloud CLI.
Figure 5: dbt Cloud CLI. Source: Coalesce 2023 Product Spotlight and Keynotes

Read more about the dbt Cloud CLI:

Continuous Integration & Deployment Dedicated Jobs

dbt Cloud has refined its job orchestration by introducing two specific job types: deploy jobs for building production data assets and continuous integration (CI) jobs for validating code changes. 

This enhancement streamlines the user experience with tailored defaults and a guided setup, embedding best practices into each job type’s configuration.

A screenshot showing where to find dbt Cloud new job types.
Figure 6: dbt Cloud new job types. Source: Update: Improvements to dbt Cloud continuous integration

The improvement is significant for CI jobs, as dbt Cloud now more efficiently determines when to run builds and tests. State comparisons between pull request (PR) code and production code are made more efficiently, minimizing unnecessary builds. dbt Cloud can now skip tests on unchanged code by deferring to an environment rather than an individual job.

Deploy jobs are designed to build into production databases, running sequentially to prevent conflicts. CI jobs, in contrast, are built to test new code, running in parallel to speed up team workflows. They also feature smart cancellation policies to avoid redundant work, increasing overall efficiency.

To maximize the efficacy of CI jobs, dbt Labs recommends setting them up in a dedicated environment linked to a staging database, ensuring isolation from production data builds. This setup allows for PR-based triggers, and with dbt Cloud’s native Git integration, it’s now simpler to manage these CI processes across GitHub, GitLab, or Azure DevOps.

To implement jobs, users must have a dbt Cloud account and the appropriate permissions. These jobs can be triggered via schedule or events, ensuring your data assets are always up-to-date.

Both CI and deploy jobs benefit from a detailed overview within dbt Cloud, providing insights into job triggers, commits, run timing, and step-by-step logs. This transparency aids in troubleshooting and optimizing the data build process.

A screenshot of a chart comparing dbt Cloud new job types.
Figure 7: dbt Cloud new job types comparison. Source: Update: Improvements to dbt Cloud continuous integration

In summary, dbt Cloud’s new CI and deploy job types elevate the efficiency and precision of building and testing data assets, offering organizations a robust and intelligent approach to data transformation workflows.

Fivetran dbt Cloud Integration

Currently scheduled to be released in early 2024, we are excited to share that Fivetran has announced a new integration with dbt Cloud that allows you to seamlessly trigger, monitor, and observe your dbt transformations from Fivetran. With this, you’ll be able to use best-in-class tooling for the movement of all your data and sync it directly by triggering your transformations for end-to-end pipeline automation.

To learn more, please visit Fivetran’s announcement page

Conclusion

Coalesce 2023 introduced a suite of impressive tools and features that have generated considerable excitement for their potential impact on data transformation. We at phData are eager to put these innovations into action.

For organizations looking to make the most of dbt, phData is ready to assist. As dbts’ 2023 Partner of the Year, we have the expertise to ensure your dbt setup is optimized and powerful, driving your organization forward.

Reach out to us to enhance your data capabilities with dbt!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit