May 31, 2023

How to Setup a CI/CD Pipeline for Snowflake Glue Projects

By Deepa Ganiger

There are a wide variety of tools available to implement CI/CD (Continuous Integration/Delivery/Deployment) processes in an organization. Some of the most prominently-utilized tools include GitHub Actions, Jenkins, and Azure DevOps.

This blog post will explore the use of GitHub Actions (GHA) for CI/CD pipeline setup. Among its many featured attributes, GHA includes Python Packaging and Testing, AWS CloudFormation Actions, and Sonar. Of these, AWS CloudFormation actions are particularly powerful, as they allow for the deployment of multiple AWS services using familiar CloudFormation templates.

This template enables your organization to not only build, test, and deploy an application code but also to deploy all necessary AWS services using “Infrastructure as a code.”

Pop back to an earlier blog post where we covered the topic of Glue and the Snowflake Data Cloud Integration if need be before we specifically dive into how to deploy the code, including AWS services, and configure GitHub Actions to perform CI/CD.

Lastly, before we discuss the actual process, let’s take a quick look at what CI/CD is and why it matters.

What Is CI/CD?

Continuous Integration and Continuous Delivery or Deployment (CI/CD) are software development practices where CI is used during integration to build automatic test codes, and CD is used for deployment. The CI/CD process ensures that the code is always validated, tested, and ready for deployment.

What Are The Benefits of CI/CD?

  • Easier bug fixes
  • Reduced project risk
  • Improved quality development
  • Higher productivity
  • Reduced lag between coding and customer value

Things To Consider Before Getting Started

At a high level, the following items should be completed before getting started:

  • Configure AWS account(s) to integrate with GitHub for automated deployment.
    This can be done in one of the following ways.
    • AWS User Id/Secret Key
    • OpenID Connect(OIDC) 
  • Continuous Integration – Use GitHub actions to build the code and run tests.
  • Use GitHub actions to deploy code on a specific scenario – pull request or push code into a specific branch using a YAML file. For multiple environments, different YAML files can be created.

We will be using OIDC-based deployment in this blog. Additionally, we’ll reuse the CloudFormation file created in the previous blog to deploy Glue services.

A diagram that highlights the OIDC-based deployment.

Configuring An AWS Account To Connect To GitHub

One of the most common ways to connect to GitHub from an AWS account is to use an IAM user. This process will require creating a dedicated IAM user with a secret key and then configuring the secret key in GitHub secrets. One of the problems with this approach is the need to store the secrets in GitHub and the manual key rotation needed when the secret key changes.

Instead of this approach, we will use OIDC, which will use ID tokens in the GHA workflow seamlessly to authenticate to AWS and deploy resources.

What follows are the detailed steps to set up OIDC in the AWS account. This process is a one-time step to set up each of the AWS accounts in the organization. This process can also be automated using the AWS CloudFormation template for deployment into different AWS accounts.

Step 1: Create OIDC Identity Provider

Step 2: Create IAM Role

A screenshot that says, "select trusted entity"

Step 3: Attach Policies to the Role

The policies will depend on the type of services deployed using GHA. In this case, we will create four policies, as shown below.

Step 4: Configure Git Repo and Federated Principal within Trust Relationship

A menu that has "Trust relationship highlighted with several lines of code underneath
A popup menu with "Github Actions" selected.

Continuous Integration

Continuous Integration is a key component in DevOps that streamlines the overall process of building and merging code developed by multiple developers in a team. GHA can be configured to automate the build and test a code every time a change is made for a specific scenario – pull request or push into a specific branch. 

Continuous Integration has several significant advantages, including the ability for multiple developers to work on various project features and easily merge them into a single branch (develop).

Additionally, it allows for the implementation of approval processes (Project/Product) for pushing changes into test and production (master) branches. The workflow in GHA is called a pipeline which can also be configured to run automated test processes (e.g pytest or coverage run) and even ensure code quality using tools like Sonar.

Several lines of code
Several lines of code

Continuous Deployment

Continuous Deployment is an extension of the Continuous Integration process where code changes are deployed automatically in Test/Production environments upon approval. GHA pipelines allow the deployment of both code and AWS services automatically. 

The workflow file can be configured separately for code and AWS services deployment. The biggest benefit is that the overall automation of this process eliminates the lag between coding and customer value – reducing the manual effort and time by weeks or even months.

In bigger organizations where 100s or even 1000s of developers work in parallel, there should be a streamlined process for testing, code quality, and deployment. DevOps automation process allows organizations to put such processes in place to handle large production changes in an automated fashion avoiding manual errors and risks. The DevOps process allows multiple checks and balances to be built in pipelines which further reduces the risks.

The code snippet below is the continuation of the YAML file for the Deployment process. This section does the following process.

  • Deploy the code from GitHub using “actions/checkout@v3.”
  • Configure AWS Credentials using OIDC. 
  • Copy the deployed code into the S3 bucket. Glue jobs refer to S3 buckets for Python code and libraries.
  • Finally, deploy the Glue CloudFormation template along with other AWS services.
Several lines of code
Several lines of code

Validation Of Deployment Within AWS

The deployment can be validated by logging into the AWS account and checking the AWS CloudFormation stack. Once the deployment completes, this page shows the stack with all the services that are deployed in that stack.

A screenshot titled, "snowflake gluestack"

Best Practices For CI/CD Deployment

  • Organizations with a single AWS account for Dev/Test/Prod should name the resources with prefixes such as DEV_ or TEST_ to avoid contention. It is recommended to establish separate roles for each environment and segregate access accordingly. Creating a single role to deploy dev/test and prod resources pose a huge security risk.  
  • Implement review and approval process for Test/Prod before merging.
  • Implement continuous integration for Pull request/Merge request rather than for each push into feature branches.

Closing

CI/CD is a critical process that requires configuration and setup much earlier in the overall project lifecycle. Defining the overall process with proper access, configuration, and automated regression testing is key to a successful setup of the DevOps process. 

While this blog covered a simple use case of deploying code in S3 and AWS services like Glue, the overall configuration and setup can become much more complicated with services like Kinesis, EMR, Lambda, etc. Different services might require different types of policies, and some services may need to be pipelined in a specific order for a successful deployment.

At phData, we offer accelerators and tools that can help streamline your entire DevOps process.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit