How To Setup Schemachange for Snowflake CI/CD

At phData, we believe that software engineering best practices should be incorporated throughout all parts of the modern data stack. One of these principles is automated schema migration and deployment. Many tools can help teams migrate SQL code more efficiently, such as Liquibase, Flyway, and schemachange.

In this blog, we will focus on schemachange, an open-source Python library that was based on Flyway but was created for Snowflake Data Cloud.

Why do we Need a Schema Migration Tool?

How do you deploy your SQL code into Production? Chances are, if you have been around for a minute, you have seen some… interesting… practices when it comes to deploying code. Do you send your scripts in an email to the admin to run in production? Maybe you use a version control system like git but still have an admin manually run code.

Now think of how many times something broke or failed. It never is the intention of the admin, but humans make mistakes. Here are some instances we have seen:

The admin used the wrong role to create the objects, and downstream access broke.
The admin only ran part of the script.
The admin ran part of the script twice.
The admin forgot a manual step that needed to happen before/after running a script.
Many others!

When deployments are not automated, the potential for adverse outcomes increases significantly.

What is a Schema Migration Tool?

When utilized effectively, it is essential to store all SQL code in a version control system such as git. This enables the creation of automated pipelines to execute these scripts in higher environments.

A crucial component in this process is the schema migration tool, which automates the execution of code on a target database. Moreover, it often incorporates idempotency features, ensuring that changes can be executed repeatedly while consistently producing the desired outcome in the target environment.

What is schemachange?

schemachange is an open-source Python library that was written by engineers working for Snowflake. While it is not Snowflake-supported, it does have good community support. Like other similar tools (Liquibase, Flyway), schemachange is an imperative schema migration tool. One differentiator between schemachange and its contemporaries is that it is more simple.

Declarative / State Based	Imperative / Migrations
`CREATE TABLE COOL_TABLE AS ( COOL_COLUMN_1 STRING, COOL_COLUMN_2 NUMBER );`	`CREATE TABLE COOL_TABLE AS ( COOL_COLUMN_1 STRING ); ALTER TABLE COOL_TABLE ADD COOL_COLUMN_2 NUMBER;`
Simpler to understand and follow. Harder to implement. Needs a process to compare and determine changes.	Harder to understand. Need all scripts run in order to get back to the correct state. Easier to implement. The process only needs to know if a script was run or not.

While you can use schemachange to manage your roles and access in Snowflake, we would recommend using our Provision tool, as it provides a way to automate users, roles, and other account objects. Schemachange and other change management tools are great for automating database objects.

Change History Table

As you will see below, schemachange is capable of keeping the state of some objects in Snowflake. In order to do this, schemachange keeps a table in Snowflake that keeps track of what has been deployed. To see an example of the table structure, look here.

Key File Types in schemachange

In order to understand how to structure your repository, you will first need to understand the different file types in schemachange. There are three types of change files in schemachange, Repeatable, Always, and Versioned. These files are prefixed with the file type followed by two underscores.

Repeatable	`R__[description].sql`
Always	`A__[description].sql`
Versioned	`V0.0.1__[description].sql`

These files are always run in the same order.

Versioned files are first.
Repeatable files are second.
Always, files are last.

One configuration value in schemachange is the directory in which you want to search for files. Inside that directory, schemachange will find EVERY file, no matter where or how it is organized.

Repeatable Files

Aptly named, Repeatable files are change files that can be run over and over with no consequence. It does NOT mean that they WILL be run on every release (look at Always files for this behavior).

One of the core features of schemachange is the ability to compare repeatable files with those objects that are already in the database and only run those that have changed.

This is done by using the change history table maintained by schemachange. Each time a repeatable file is run in the target database, schemachange takes the entire contents of the file and generates a checksum. Each subsequent time, a new checksum is generated and compared to the existing checksum for the given file.

If there is a difference, it will run the entire file again. If the checksum is identical, it will be skipped. This is key to keeping deployment run times down and allows for a more clean repo.

The key way to identify if an object/file should be repeatable is if it can be run over and over. For example, if you ran CREATE OR REPLACE VIEW … repeatedly, it would be fine. If you ran CREATE OR REPLACE TABLE … when it had data in it, it would remove all the data in the table – not good!

*Note: schemachange always runs repeatable files that have differences in alphabetical order. This can cause issues if repeatable scripts have dependencies on other repeatable scripts. See the Views that are dependent on each other section below for more info.

Below is a list of objects that could be created as repeatable files in Snowflake:

Views
Procedures
Functions
Warehouses

Always Files

Always files will be applied during every schemachange run no matter if it needs to or not. This is not particularly helpful when it comes to actual database objects in Snowflake, but there are some use cases where files could always be helpful.

Below is a list of possibilities:

Access/security scripts that need to run after objects are created. For example, a procedure could be created that checks if there are any missing/broken permissions. After a table is created, it can run and give all the proper permissions (although, if you want to do this, make sure you can’t use future grants).
After every release, the prod environment needs to be cloned down to the test environment.
There is a logging script that inserts into a table each time a deployment happens (different than the schemachange change history table).

Versioned

Versioned files are change files that allow for incrementing changes in a database. Schemachange keeps track of the most recent version in the change history table and will run any files which are newer than the most recent version in the target database.

Each version can exist only one time in the repo. There is no limit to how many scripts are in a file as long as they are separated by semicolons.

One common misconception is that each object has its own version. For example, let’s say you have two tables in two different files in your repository – TABLE_A and TABLE_B. These will need different versions, even though they may be created at the same time. This is because versions are shared across the entire project.

Incorrect:
V0.0.1__TABLE_A.sql
V0.0.1__TABLE_B.sql

Correct:
V0.0.1__TABLE_A.sql
V0.0.2__TABLE_B.sql

schemachange is flexible and offers many different version options such as:

1.1
1_1
1.2.3
1_2_3

Structuring Your Repository

Because schemachange will find all files within the root-folder directory, much of the structuring of files is up to you. This is both good and bad, as you have the flexibility to fit your needs, but you can also set yourself up for some strife down the road. Below is a list of best practices that we have seen work well.

Separate by File Type

At the root directory, we would recommend creating three folders, one for each type of file. From there, each file type has different uses, so different structures should be used.

Always – Because no defined objects are going in always files, this folder can be customized to your needs. Do you only have a few scripts here? If so, there might not be a need for more organization.
Repeatable – These should be organized by database > schema > object type.
Versioned – These could be organized into folders, but based on experience, this creates a mess and makes it hard to find what version you are on for the project. The recommendation is to leave this flat and archive it into a separate folder when it gets too large.

				
					schemachange
└── scripts
   ├── always
   ├── repeatable
   │   ├── STAGING_DB
   │   │   ├── DATASOURCE_1
   │   │   │   ├── procedures
   │   │   │   │   └── PROC_DO_A_THING.sql
   │   │   │   ├── tasks
   │   │   │   └── views
   │   │   │   │   └── VW_DEDUPLICATE_JSON.sql
   │   │   └── DATASOURCE_2
   │   │       ├── procedures
   │   │       ├── tasks
   │   │       └── views
   │   └── MART_DB
   │       ├── MART
   │       │   ├── procedures
   │       │   ├── tasks
   │       │   └── views
   │       └── REPORTING
   │           ├── procedures
   │           ├── tasks
   │           └── views
   └── versioned
       ├── V0.1__TABLE_A.sql
       └── V0.2__insert_records_into_A.sql

Schemachange Configurations

In case it isn’t obvious yet, schemachange was built for flexibility. To accommodate many different teams, schemachange offers many different configurations that can be set/adjusted based on your needs. For a complete list, look here.

Another benefit of automating into a pipeline is that we can set different configuration values based on the environment we are targeting. This allows us to use different users/roles for different environments.

Config values can be set either in a config file or directly in the CLI command. In the below demo, we will do both.

Integrating With your CI/CD Pipelines

Once you understand how schemachange works with your SQL scripts, you can then move on to automating it. This is where the magic happens.

For this example, we will use Azure Pipelines and Azure Repos, but this should be similar to GitHub actions and others.

High-Level Architecture

Below is a list of assumptions for our schemachange demo. This is a great starting point for small teams not as familiar with automated deployments.

3 environments
- DEV, TST, PRD
  - DEV will be maintained by the developers and will have full access.
  - TST and PRD will be maintained only by schemachange, and developers will NOT have any access other than read-only.
The developer will make their changes to DEV manually and commit their changes to a branch in their Snowflake repo in Azure Repos. A Pull Request (PR) will be created and approved by the team.
Once the PR has been approved and completed, a CI/CD pipeline will be triggered, and the schemachange will run in TST. If that goes well, it will be deployed to PRD.
A dry run will be used to test the deployment (more detail below).

Dry Runs

One great feature for testing/seeing what your deployment will do is Dry Run. Doing a dry run in schemachange will allow you to see what files will be run during the deployment. Do note that NOTHING is actually being run in Snowflake. It only shows what will be run if you do an actual run.

We will implement this feature to give teams an opportunity to review what files will be run before running them. This is a nice guardrail that prevents bad commits from polluting the repo.

For example, it is very possible to add an incorrect or old version to a versioned file, or perhaps you had a typo in the prefix of a repeatable file. Dry runs will help you catch that.

Dry runs are configured by either setting the dry-run: false value in the config file or by using the --dry-run flag in the CLI command.

Look above at the architecture diagram and see how we plan to integrate the dry run into the process.

Setup

Assumptions for Running Schemachange

As you can see in the list below, there is not much needed to set up schemachange. It is assumed that your team knows how to use git and Pull requests, as they will be a key feature in the CI/CD pipeline described later.

A git provider (Github, Azure Repos, Bitbucket, etc.)
A pipeline tool (Github Actions, Azure Pipelines, etc.)
A Snowflake database/schema where the schemachange change history table will live.
The schemachange change history table.
1. This can be created manually or can be configured to be created automatically when schemachange runs the first time.
A Snowflake service account that has a role with all of the permissions needed to create the objects you want to deploy and has access to read/write in the schemachange change history table.
1. It is recommended that you create a user per environment to keep defined lines between environments.

In our repository, we can create a new directory called cicd which will host all our scripts needed for Azure Pipelines and our config files.

				
					schemachange
├── cicd
├── scripts
   ├── always
   ├── repeatable
   └── versioned

Schemachange Configuration

Inside the cicd directory, we are going to create 2 files, one for each target environment that we will be hitting with schemachange. Remember that this will only be TST and PRD because DEV will be maintained by the developers.

				
					schemachange
├── cicd
│   └── schemachange_config
│      ├── tst.yml
│      └── prd.yml
├── scripts
   ├── always
   ├── repeatable
   └── versioned

The contents of each file will be the following. The tst.yml and prd.yml files will be very similar, except they should reflect the proper users/roles/databases.


#same for each env
config-version: 1
root-folder: 'scripts'
modules-folder: null
snowflake-account: 'XXXXX' # for example: XXXXX.west-us.azure
snowflake-warehouse: 'COMPUTE_WH'
create-change-history-table: false
autocommit: false
verbose: true


#changed within each env
snowflake-user: 'SVC_PRD_SCHEMACHANGE'
snowflake-role: 'PRD_SCHEMACHANGE'
change-history-table: 'UTILITY.SCHEMACHANGE.CHANGE_HISTORY'

Azure Pipelines

Since this is a tutorial for schemachange and not Azure Pipelines, we will not go into the details of the azure pipeline but will give some sample code and call out some specifics.

At a high level, the pipeline is triggered by a commit on the main (when the PR is merged in). The pipeline consists of 2 stages, with 3 jobs in each stage (see the architecture diagram above).

Here is an example of the azure-pipelines.yml file that can be used to run schemachange.

azure-pipelnes.yaml


trigger:
 # runs anytime there is a commit on main
   - main
  stages: 
   #### tst ####
   - stage: deploy_tst
     variables:
       - group: schemachange_variables
       - name: env
         value: "tst"
     jobs:
       # run dry run
       - job: dry_run_tst
         pool:
           vmImage: 'ubuntu-latest'
         steps:
           - template: templates/deploy-template.yaml
             parameters:
               snowflake_password: $(snowflake_pass)
               dry_run: '--dry-run' #the dry run parameter is passed to the cli command in the deploy-template.
       # wait for manual validation of dry run in previous step.
       - job: dry_run_validation_tst
         dependsOn: dry_run_tst
         pool: server
         steps:
           - template: templates/wait-template.yaml
       # run deployment
       - deployment: deploy_tst
         dependsOn: dry_run_validation_tst
         pool:
           vmImage: 'ubuntu-latest'
         environment: "TST"
         strategy:
           runOnce:
             deploy:
               steps:
                 - template: templates/deploy-template.yaml
                   parameters:
                     snowflake_password: $(snowflake_pass)
                     dry_run: '' #the dry run parameter which is blank is passed to the cli command in the deploy-template.
    #### prd ####
   - stage: deploy_prd
     variables:
       - group: schemachange_variables
       - name: env
         value: "prd"
     jobs:
       # run dry run
       - job: dry_run_prd
         pool:
           vmImage: 'ubuntu-latest'
         steps:
           - template: templates/deploy-template.yaml
             parameters:
               snowflake_password: $(snowflake_pass)
               dry_run: '--dry-run'
       # wait for manual validation of dry run in previous step.
       - job: dry_run_validation_prd
         dependsOn: dry_run_prd
         pool: server
         steps:
           - template: templates/wait-template.yaml
       # run deployment
       - deployment: deploy_prd
         dependsOn: dry_run_validation_prd
         pool:
           vmImage: 'ubuntu-latest'
         environment: "PRD"
         strategy:
           runOnce:
             deploy:
               steps:
                 - template: templates/deploy-template.yaml
                   parameters:
                     snowflake_password: $(snowflake_pass)
                     dry_run: ''

As you can see, I am making use of template files to abstract some logic (keeping things DRY).

The deploy template runs both the dry run and the actual deployment for both TST and PRD.
The wait template contains the step that waits for a user to approve the step. The intention is that the user should review the dry-run and, if all looks well, resume the deployment. This then moves onto the final deployment in the stage (without the dry-run param set).

Variable groups are used to hold variables such as user/pass and other secrets that are referenced in the pipeline.

Here are the template files:

deploy-template.yaml


parameters:
- name: 'snowflake_password'
 default: ''
 type: string
- name: 'dry_run'
 default: ''
 type: string


steps:
   # choose your python version here.
 - task: UsePythonVersion@0
   displayName: 'Use Python 3.x'
   inputs:
     versionSpec: 3.x


 - script: pip install schemachange
   displayName: install schemachange
  
 - script: schemachange --config-folder $(System.DefaultWorkingDirectory)\cicd\schemachange_config\$(env) ${{ parameters.dry_run }}
   env:
     SNOWFLAKE_PASSWORD: ${{ parameters.snowflake_password }}
   displayName: Run Deployment ${{ parameters.dry_run }}
   workingDirectory: $(Build.Repository.LocalPath)

wait-template.yaml


steps:
 - task: ManualValidation@0
   timeoutInMinutes: 60 # This is a limit of the serverless task
   inputs:
     instructions: 'Please validate the output from the "Run Deployment --dry-run" step and resume'
     onTimeout: 'reject'

CLI Command

Notice in the deploy-template that there is one schemachange CLI command that is driving the entire process.

Let’s break apart this statement:

				
					schemachange --config-folder $(System.DefaultWorkingDirectory)\cicd\schemachange_config\$(env) ${{ parameters.dry_run }}

To call the CLI from a script, you simply call schemachange and pass any parameters along with it. In our case, we are passing two parameters, the config-folder and the dry-run parameter.

The config-folder dynamically references either the TST file or the PRD file based on the $(env) variable, which was passed from the azure-pipelines.yaml parent file. The env variable originates from the Azure DevOps variables group, which is set in that file.

The dry-run parameter will show up as either blank or append the text 'dry-run'. This is important as this is not a True/False parameter but should only appear if the dry-run is desired. This is why the parent file (azure-pipelines.yaml) is passing either 'dry-run' or '' to the deploy-template file.

Jinja and Variables

One feature of schemachange that we have not yet discussed include using jinja in our SQL files. This is fully supported and allows us to do a lot of cool things that would be otherwise impossible with SQL alone.

Another feature includes using variables in our SQL code. This is as easy as adding variables to the config file or directly in the CLI command.
In our case, let’s assume that we have two types of databases in each environment- STAGING and MART.

cicd/schemachange_config/tst/schemachange-config.yml


vars:
 stg_db: "TST_STAGING"
 mart_db: "TST_MART"

cicd/schemachange_config/prd/schemachange-config.yml


vars:
 stg_db: "PRD_STAGING"
 mart_db: "PRD_MART"

This allows us to use one SQL file but reference the proper environment’s database.
For example:

				
					CREATE OR REPLACE VIEW {{ mart_db }}.MART.VW_SALES AS
SELECT *
FROM {{ stg_db }}.SOURCE.SALES;

Notice how both the mart_db and stg_db are references to variables. This is critical to maintaining one file that can be deployed into multiple environments.

When the file runs in TST, it will look like the following script with the values from the config file replaced into the script that is run in Snowflake.

				
					CREATE OR REPLACE VIEW TST_MART.MART.VW_SALES AS
SELECT *
FROM TST_MTAGING.SOURCE.SALES;

Pitfalls to Avoid

Historical Loads

One situation to consider is if you already have objects created in Snowflake and you want to backfill your repository and schemachange with the existing objects.

This presents a problem because schemachange will see all of the new files and try to run them in Snowflake even though they already exist.

To resolve this issue, you will need to do the following:

Versioned files
- Insert a record manually in the change history table with a version higher than all of the versions currently in the repo. This will cause schemachange to skip all the lower versioned files because schemachange simply looks for the highest version as the current watermark.
Repeatable files
- Similar to versioned files, you will need to insert records in the change history table. The difficult part here is that you also need to put the correct hash value in the table so that the schemachange doesn’t see a difference when it compares the first time. This can be done via deconstructing the source code and running your files through, or you could run each repeatable file in an empty database and copy the change history table records to your target database/change history table.

Views That Are Dependent on Each Other

Another issue that might arise is when a view is dependent on another view, schemachange is not able to logically run views in the order they are needed. This means that it is on the development team to split views into multiple commits/deployments.

For example, consider the following 2 views:

				
					--vw_bananas.sql
CREATE OR REPLACE VIEW VW_BANANAS AS
SELECT *
FROM TABLE_A;

--vw_apples.sql
CREATE OR REPLACE VIEW VW_APPLES AS
SELECT *
FROM VW_BANANAS;

Notice that the second view, VW_APPLES references the first view VW_BANANAS. If these views were added in the same commit, schemachange would run VW_APPLES first (schemachange runs repeatable files in alphabetical order) and then VW_BANANAS. VW_APPLES would fail because VW_BANANAS has not been created yet.

Conclusion

In conclusion, incorporating automated schema migration and deployment practices into your data stack is crucial for ensuring efficient and error-free deployments. With tools like schemachange, you can streamline the execution of SQL code in your target database.

By leveraging schemachange, along with version control systems and CI/CD pipelines, you can automate the deployment process, reduce the potential for errors, and enhance the overall reliability of your data infrastructure. Embracing these software engineering best practices will undoubtedly contribute to a smoother and more efficient data deployment experience for your team.

At phData, we believe that CI/CD pipelines should be commonplace in every data team. If this is something you’re interested in, contact phData today!