When it comes to releasing new code, one of the fundamental principles of software engineering is continuous integration testing. There are many frameworks for testing software, but the right way to test the data and SQL scripts that change data are less obvious. This is because databases and the data therein are constantly changing.
Consider the scenario where you create a view in the database using your Development (DEV) environment. You want to use a column in a table that someone else created in DEV and has not moved to Production (PRD) yet.
When you try to create your view in DEV, it will be created and run perfectly, but if you were trying to deploy to PRD, it would fail!
This is a simple example of how SQL that compiles and runs perfectly might fail when trying to migrate it to a higher environment like production.
To truly test the effects of a deployment, you need to have an environment with the exact data that is in Production. That is where Snowflake’s Zero Copy Cloning comes in.
In this blog, we’ll explore how to properly leverage the Zero Copy Clone feature from the Snowflake Data Cloud.
CI/CD Pipeline Patterns
While Continuous Integration/Continuous Deployment (CI/CD) pipelines are extremely common in the world of software engineering, they are often overlooked in data teams. When a team does implement a pipeline for their data stack, it usually includes using multiple environments, pull requests, and a schema management tool.
Successful teams learn quickly that it is important to have different environments for testing new code so production code is not poorly affected by new code. At a minimum, a team should use a Development and a Production environment, and more sophisticated teams can have up to five or six environments depending on their needs.
Below is a list of common environments to consider.
Development Environment (DEV): This is where software development and testing take place. It is a sandbox environment where developers can experiment and test their code without affecting the production environment.
Testing Environment (TST): This is a replica of the production environment, where developers can test their code in a realistic environment before deploying it to production.
Staging Environment (STG): This is a pre-production environment that mimics the production environment as closely as possible. It is used to test the software in a realistic environment with a small group of users before releasing it to the general public.
Production Environment (PRD): This is the live environment where the software is deployed and used by the end users. It is a critical environment that must be stable and secure, as any issues can affect the end users and the business.
A pull request (PR) is a feature in version control systems that allows developers to propose changes to a codebase and get feedback from other team members before merging them into the main branch.
During a PR, additional testing can occur by using a workflow or pipeline that is triggered when a PR is created. The goal is to catch any potential issues with the new code before it is merged into the main/master branch and infects data downstream. Catching issues with SQL specifically can be difficult without deploying it- but more on that later.
More sophisticated teams will conduct security scans on their code or catch formatting errors with a linter (like SQLFluff) during their PR-triggered workflows/pipelines.
What is a Schema Management Tool?
A schema management tool allows teams to house their code in a Git repository and automate the deployment of that code to multiple environments. Many open-source and free tools exist, such as Flyway, Liquibase, schemachange, or DataOps.
These tools allow teams to deploy code in a repeated fashion without having to rely on human intervention. This saves hundreds of hours of time for admins trying to deploy code, not to mention avoiding the countless mistakes that humans make when deploying manually.
While the tools listed above help deploy specific database objects like schemas, tables, views, etc., there are other tools that allow for automating accounts, roles, and permissions.
Two of these tools are Terraform and phData’s own Provision tool.
What is Snowflake’s Zero Copy Cloning?
Snowflake has so many features that make it the leader in the Cloud Data Warehouse market. One of which is the concept of Zero Copy Cloning. Cloning in Snowflake simply means that the data in the clone is not a copy of the original data but simply points back to the original data.
This is extremely helpful due to the fact that you can clone an entire database with terabytes of data in seconds. Changes can then be made to the clone without affecting the original database.
This feature can be used in many ways. Everything from creating your environments (DEV, TST, STG, etc.) to doing specific isolated testing on a single branch to using clones in your CI/CD pipeline (hold tight, we are almost there…).
Putting it All Together
Now that we have seen the power of cloning databases in Snowflake, we can use them in our CI/CD pipelines –specifically, in the Pull Request workflow/pipeline described earlier.
Imagine this: you are able to deploy your code to “production” just to see if everything would succeed. If it fails, you don’t have to roll back! This is what we can accomplish using Snowflake database cloning.
Consider a hypothetical situation where a team has three environments: DEV, TST, and PRD. DEV is a sandbox environment where developers have full access to run any code they want. TST is an environment where code from the main branch is deployed and tested by a QA team before getting deployed to PRD.
Here is a diagram showing our imaginary development cycle:
Let’s dig a little deeper to understand the magic of the PR-triggered workflow/pipeline.
Below is a list of the various items to consider when creating a PR-triggered pipeline for Snowflake.
How to Create a New Pipeline
One assumption is that the CI/CD pipeline is used to deploy to TST and PRD after the PR has already been created. This could be done in a number of tools like Github Actions, Azure DevOps Pipelines, or Jenkins.
A new pipeline or workflow (Github) must be created and enabled to run every time a PR has been created. The pipeline needs two main steps:
1. Run the create clone SQL statement.
2. Run the deployment steps in your schema change management tool.
Create the Database Clone
As mentioned above, cloning a database can take seconds in Snowflake. It is as simple as a one-liner:
CREATE OR REPLACE DATABASE PR_DB CLONE PRD_DB;
This could be run every time the pipeline runs. But there are some considerations/gotchas.
- While data takes almost no time to copy, it does take time to copy the metadata objects. This means that the amount of objects (tables, views, procedures, etc.) will directly impact how long a clone will take. It should also be noted that the clone happens in the Cloud Services warehouse, not in any standard compute warehouses.
- This also assumes that one pipeline will run at a time. If you have a bigger team, you may want to make the name of the database clone more dynamic so it doesn’t overwrite an ongoing test.
Another strategy that offsets some of the gotchas above is to simply clone once a day. This means that changes are not completely isolated, which does add some risk, but might be worth it depending on your situation.
In your pipeline, you may want to run this using SnowSQL or the Python connector.
Run Deploy Step
Use your schema change management tool to deploy to the new clone. This can be done similarly to how you currently deploy to TST and PRD, only pointing to the new DB clone instead.
You did it! Now you are testing your new changes in a production environment without wrecking your real production environment.
To take it one step further, run an automated test!
Run Automated Tests in the New Clone
Testing that your code will run successfully is critical to a smooth deployment to TST/PRD. However, it is only step one. To progress in your journey to awesomeness, you can now begin adding in automated tests. This can be done in many ways, such as using a data quality testing framework like Soda.io or writing your own scripts.
Here are some ideas of simple tests you can run:
- Duplicate Key Tests. One fact you find out quickly after using Snowflake is that Primary Keys are not enforced in Snowflake. So it is important to ensure your new code is not introducing duplicate records. You can even do this with views as well as tables. This can be done by using a configuration file with a table/view name and their PKs and looping through using dynamic SQL.
- Valid View Test. If you alter an underlying object that a view is using, it could break the view. This can be tested by simply running each view and ensuring they run. Tip: use a filter of 0=1, and the query will not use as much computing trying to bring back results. It will simply validate the metadata.
With this PR-triggered pipeline in place, you can automatically test changes BEFORE a teammate even reviews the code in a PR. This reduces the risk of introducing bugs into your codebase. By eliminating bad code, your branch stays clean and stable, making it easier for you and your team to collaborate and maintain the code.
Overall, having a PR-triggered pipeline is a crucial step in modern software development that leads to higher-quality code, faster delivery times, and increased productivity.
At phData, we believe that CI/CD pipelines should be commonplace in every data team. If this is something you’re interested in, contact phData today!