DevOps, part of an overall DataOps strategy, is a crucial piece of improving consistency, scaling your organization, and ensuring quality for consumers of your system(s). Organizations frequently have entire teams dedicated to implementing a DevOps practice within their organization, and there’s a good reason for that. Companies that implemented a DevOps practice reported:
- 63% experience improvement in the quality of their software deployments
- 63% release new software more frequently
- 55% noticed improved cooperation and collaboration
- 38% report a higher quality of code production
The concept of DevOps has been around for quite a while within software engineering but has recently started to find adoption within data engineering, for a good reason.
Companies rely on data to make important decisions across all areas of their business. This means that teams building data products, data pipelines, and analytics must ensure that a high bar is set for quality and consistency.
As part of a data governance strategy, data should be usable, have integrity, be available, and be secure.
DevOps helps ensure that all of these qualities are met.
In this post, we’ll take a look at DevOps with the Snowflake Data Cloud, how you would implement a DevOps practice with Snowflake, and what phData recommends along the way.
Building a DevOps Practice on Snowflake
DevOps as a whole covers many different aspects of the software development life cycle (SDLC). When building a DevOps practice, you’re implementing things like:
- Deployment automation
- Infrastructure As Code
- Quality gates
Let’s drill down into each of these and look at how we can implement them with Snowflake.
Deployment Automation
One of the most common pieces of data engineering is building or making adjustments to data models, data pipelines, and data products. Historically, many of these changes would be made by a database administrator (DBA) as part of a change management system.
Engineers would have to request that their changes be promoted from lower environments at a particular date and time. Then once the changes are made, teams would have to perform a suite of manual tests and report their findings.
If changes need to be reverted, the DBA would have to get involved again to make the appropriate changes. This is a very manual, time-consuming process that is prone to human error.
This is what DevOps aims to fix.
Automating Deployments with dbt
One of the ways we can start to address this is by introducing tools such as dbt. dbt gives us the ability to:
- Define tables, views, and other objects
- Define how to load data into those objects (incremental vs full loads)
- Create reusable components via macros
- Track changes over time via git
- Deploy and review changes via a pull request schema
- Define pre/post hooks for things like session variables and cleanup tasks
- Define a suite of tests against your data
- Leverage third-party plugins like Great Expectations for data profiling
- Generate documentation and data lineage
By leveraging dbt for deployment automation, we’re now able to audit changes more accurately and quickly in lower environments. We’re also able to programmatically deploy these changes to each environment, test and profile the data to ensure quality, and generate documentation with data lineage for consumers to understand how data is being generated and used.
Automating dbt Deployments
To further improve our deployment automation, we need to introduce additional tooling to automate the execution of dbt. Teams need to have the ability to run different steps depending on whether code is introduced in a pull request or if changes are merged into particular branches/environments.
Depending on your organization’s needs and architecture, you’ll generally fall into two camps: executing dbt via a continuous integration service or an orchestration tool.
When deciding whether to use a continuous integration service, there are a number to choose from. Here are some examples:
- Github Actions
- Bitbucket Pipelines
- AWS CodeCommit/CodeBuild/CodeDeploy
- Azure DevOps
These tools allow you to define a series of steps that are executed every time a particular condition is met. This could be a manual trigger, changes to git branches, or external triggers.
Note that these continuous integration services will only run dbt once per commit to your branch. This will work for views built with dbt, but you will still need to leverage a scheduler to run dbt on a cadence to rebuild tables.
If you decide to purchase dbt Cloud, you can put these steps in the individual dbt job definition and leverage dbt Cloud for a scheduler to run your dbt workflow on a particular schedule. Tools like Fivetran also have the ability to kick off dbt jobs after an ingestion has finished.
If your organization has more complexity, dependencies, or you need to ensure a specific order of events takes place to build your data pipelines, products, and models, you can use an orchestration tool such as:
- Airflow
- Prefect
- Dagster
These tools give you the ability to perform a wide variety of tasks with specific dependencies and order defined.
For example, let’s say you needed to ensure that all data was loaded from your data sources into Snowflake before executing your data pipeline. You could define a directed acyclic graph (DAG) that has a step for loading each source dataset into Snowflake, then once all those steps are completed execute dbt to update your data models/products, and then perform any additional steps like notifying stakeholders of new data.
Infrastructure as Code
Thus far we’ve talked about how to make changes to data products, pipelines, and models. This only covers a small subset of the overall Snowflake resources necessary to implement Snowflake successfully. We also need to have the ability to create and update things like:
- Warehouses
- Users
- Roles
- Grants
- Databases
- Schemas
- Resource Monitors
- Tags
Organizations may have hundreds to thousands of these resources that traditionally would be manually managed by a DBA. This becomes nearly impossible to manage and audit at scale.
Ideologically, we’d like to have a similar way to solve this as we did previously. That means we need changes to be tracked, tested, and applied programmatically. We’ll need the ability to leverage our continuous integration service and we’ll need the ability to represent our infrastructure as code that is tracked via git.
We will also need a way to audit our environment to make sure data governance strategies are followed.
phData has built a tool specifically around infrastructure as code for Snowflake called Project Administration, part of the phData Toolkit. This free tool leverages YAML files that define groups and members of those groups. It also allows you to integrate LDAP in order to programmatically add members to a group when users are updated in your existing security implementation.
These groups and members are translated into SQL statements that are then incrementally applied to your Snowflake environment. You can also perform a dry run to validate what changes will be made to your environment, and these can be included inside of pull requests that are generated in your git repository.
Our Project Administration tool can also integrate with ITSM tools like Jira or ServiceNow to service users requesting access to a particular Snowflake object or creating a new project.
phData has also built a tool to audit your existing Snowflake environment.
Another infrastructure as code option is to leverage tools like Flyway.
Quality Gates
When engineers want to introduce new functionality or changes to existing data products, pipelines, and models, it’s important to ensure that a certain level of quality is met. Your business and customers constantly make decisions based on the data they have available to them.
dbt allows us to define a suite of tests against our data and can also be paired with additional third-party libraries to extend its functionality. These third-party libraries allow you to perform things like data profiling, data privacy, and constraints.
To implement this, we need to define steps in our continuous integration service as part of our deployment strategy. Your pipeline could do something like the following:
- Execute dbt against a test schema in your target Snowflake instance
- Execute your dbt tests, data profiling, and any other checks against the test schema
- Once all checks pass, execute dbt against your target schema
- Re-execute your dbt tests, data profiling, and any other checks against the target schema
This ensures that changes are only applied to the deployment’s target schema once the changes have been validated in a schema that doesn’t impact your business and customers.
What Recommendations Does phData Make?
We’ve covered several different areas of a DevOps practice within Snowflake. In summary, our list of recommendations includes the following:
- Choose a continuous integration service for programmatically applying changes to your Snowflake instance
- Leverage dbt and git to track, test, and apply changes to your Snowflake data models, pipelines, and products
- Utilize phData’s Project Administration tool to provide infrastructure as code, provide integration with LDAP, and service ITSM requests
- Leverage phData’s Privilege Audit tool to regularly audit your Snowflake instance
- Use SCIM with your identity platform for provisioning and deprovisioning users in your Snowflake instance
Our sincere hope is that this blog has helped you implement a DevOps practice with your Snowflake account. If your organization needs additional guidance, our Snowflake experts are more than happy to help.
To learn more about how phData can help drive the value of your organization’s data forward in Snowflake, don’t hesitate to reach out!