At phData, we’re starting to see a sharp increase in clients who are looking to migrate to dbt Cloud from dbt Core. This is due to a number of reasons, which we covered in a previous blog post, but in short, it’s primarily due to a need to accelerate platform adoption outside of central IT teams.
dbt Cloud is a SaaS product that offers central IT teams a more efficient method to speed up user and project onboarding, without requiring them to expand their infrastructure or teach end-users complex concepts such as source code management and git workflows.
When used with the Snowflake Data Cloud, dbt Cloud is the perfect match for our customers looking to scale and grow platform adoption. With Snowflake, customers are no longer required to plan and provide additional infrastructure for each workload to accommodate extra capacity.
This blog will walk you through the steps to successfully execute a dbt Core to dbt Cloud migration. Before we jump in, it’s important to note that the overall migration process can be broken down into the following functional areas:
- Team and project alignment
- User authentication
- Group-based access
- Source code configuration
- Data platform setup
- Job execution
A dbt project is where all collaboration in relation to transforming and materializing data into business value happens. Migrating to dbt Cloud from Core is a prime opportunity to rethink strategies around project setup and alignment.
Generally, the best practice is “less is more.”
Our recommendation is to minimize the number of dbt projects to create better collaboration. The benefit of having a smaller number of larger projects is you’ll unlock a complete view of model lineage and have richer documentation across functional areas. These projects should include all functional areas within the data platform including analytics engineering, machine learning, and data science.
Data governance and data classification are potential reasons to separate projects in dbt Cloud. For strictly confidential datasets or ones that contain PII or PHI, it is recommended to create separate projects for the source and staging layers of model creation. Security policies can then be applied to the published models. The result is then included as private packages in the larger dbt projects.
Finally, the size of the team and the number of models maintained in a single project may necessitate the need to create separate projects. When managing a large number of models, the best practices for pull requests and code reviews may involve too many people to be efficient and effective.
The first step to configuring enterprise security in dbt Cloud is to configure SSO using SAML 2.0 (ex: OKTA, Azure AD, Google, etc). Once configured within dbt, the setup will enable just-in-time provisioning of users, and access is provided via group memberships.
After SSO is enabled and project alignment has been determined, it is now time to secure who has access to them and with what role. Permissions within dbt Cloud can be thought of in two groups, Account privileges, and Project privileges.
Account Privileges Include:
- Account Admin
- Project Creator
- Account Viewer
Project Privileges Include:
- Git Admin
- Database Admin
- Team Admin
- Job Admin
- Job Viewer
It is recommended to map these permission sets to Active Directory groups. Doing so will enable self-service user onboarding that follows standard ITSM processes. Depending on the individual’s role in your organization, they will be granted both Account and Project privileges, however, Account privileges should be limited with the principle of least privilege implemented.
Source Code Configuration
Now that the project is created and user access has been configured, it is time to connect the existing Core source code repository to dbt Cloud. Depending on the source code management solution, this often requires setting up OAuth connections.
phData recommends that the same AD groups used for dbt project level privileges are used when configuring who has access to the code repository. Finally, the main branch should be protected and no commits should be allowed without a fully approved pull request.
When a developer is tasked with creating a feature or enhancement to existing dbt models, they must have a workspace to iterate and build these changes. This work needs to be isolated from work already published to the main branch and is often done in a developer-specific schema.
phData recommends using OAuth over usernames and passwords when integrating with the Snowflake. This allows the user to use their SSO credentials and pass them through when operating on Snowflake. A Snowflake ACCOUNTADMIN must first create a security integration in Snowflake before developers can use OAuth.
CREATE OR REPLACE SECURITY INTEGRATION DBT_CLOUD TYPE = OAUTH ENABLED = TRUE OAUTH_CLIENT = CUSTOM OAUTH_CLIENT_TYPE = 'CONFIDENTIAL' OAUTH_REDIRECT_URI = 'https://cloud.getdbt.com/complete/snowflake' OAUTH_ISSUE_REFRESH_TOKENS = TRUE OAUTH_REFRESH_TOKEN_VALIDITY = 7776000;
with integration_secrets as ( select parse_json(system$show_oauth_client_secrets('DBT_CLOUD')) as secrets ) select secrets:"OAUTH_CLIENT_ID"::string as client_id, secrets:"OAUTH_CLIENT_SECRET"::string as client_secret from integration_secrets;
Finally, these values are added to the dbt Cloud account and developers are then able to use OAuth between dbt Cloud and Snowflake.
Next, a service account must be created in Snowflake to execute scheduled jobs that have been merged into the main branch. Often the Service Account is created directly in Snowflake as a managed user. phData recommends that service accounts are authenticated using private key pairs (not usernames and passwords).
Often this user would have been configured in a Core environment already and can be reused in the Cloud configurations. The Service Account should be granted a functional role that has access to a Snowflake Virtual Warehouse and a data access role to operate on a database or a collection of databases.
Before we can begin working through scheduling, we need to take the already existing or recently created Service Account and configure deployment environments that match the environments that exist within a profile used for dbt Core.
A review can be done of the naming and structure of these profiles to make sure the new dbt environments align with your deployment and testing strategy. This is also an opportunity to look at what impact you might have on your dbt Core by renaming or reworking your environment configuration. This is especially relevant if you are overloading any of the existing dbt functionality to customize the database name, schema name, or even table names.
dbt Cloud has built-in customizable Job scheduling for transforming and materializing dbt models. Refactoring dbt Core job schedules to dbt Cloud is often the most complex part of the migration process. Core customers may be using a wide variety of orchestration tools to execute cloud models such as Apache Airflow, Control M, Azure Data Factory, or Perfect (just to name a few).
While these tools can be used with dbt Cloud and orchestrated with calls into the Rest API, customers may take this time to simplify how data gets materialized with dbt Cloud. When no additional external dependencies are needed as part of the materialization process, dbt job scheduling offers an easy-to-use interface for scheduling and monitoring job execution.
Often jobs would be scheduled as part of a data ingestion activity. Due to this, phData recommends configuring Fivetran and dbt together to run as data is loaded into Snowflake. Additionally, monitoring is simplified by utilizing notifications in dbt Cloud, which allows the right users to be alerted when a job is completed, canceled, or failed.
By utilizing webhooks to trigger CI jobs and the API to initiate jobs, dbt Cloud has the potential to significantly reduce operating expenses.
Need Help Moving to dbt Cloud?
Looking to add dbt Cloud to your modern data stack? phData can help! With a prescriptive approach to migrating Core projects, our team can get you migrated with minimal downtime and help educate developers about dbt Cloud to keep them efficient and effective.