*This article is authored by both Arnab Mondal and Samuel Hall.
In a previous post, we talked about setting up all the components necessary to create a pipeline for ingesting data from a custom source into the Snowflake Data Cloud using Fivetran. This involved setting up an AWS Lambda connector in Fivetran, which would query data from the Lambda function and pass it back to Fivetran.
Then in a subsequent post, we discussed the code necessary to include in that Lambda function so that it would be able to pull data from Slack.
In this post, we will talk about how to automate the setup and maintenance of the entire pipeline, from Slack to Snowflake, using Terraform, an infrastructure as a code tool.
What is Infrastructure As Code?
Infrastructure as code (IaC) automates the creation and maintenance of infrastructure, such as data pipelines, as source code. In basic terms, we can create configuration files to declaratively define what our pipeline is and what it will do and then allow a special piece of software to read that configuration and apply the changes to the pipeline itself.
Advantages of Infrastructure as Code
- Changes can be made faster and more reliably using a consistent development and deployment process.
- Declarative code management eliminates the need for the user to know whether a component needs to be created/updated/removed. What is declared in the configuration is what is applied. That’s it.
- Configuration can embed and enforce best practices and standards as new infrastructure is created.
- The configuration can be checked into a source control management system like GitHub. This allows for tracking and controlling changes.
Businesses who decide to adopt IaC are able to streamline their infrastructure deployment processes and take advantage of the benefits listed above, as well as allow their teams to focus on new features instead of maintenance and technical debt.
Slack to Snowflake Pipeline Overview
In case you have not read the previous blog posts mentioned in the introduction, here is an overview of what our Slack to Snowflake pipeline architecture looks like:
However, instead of manually setting all of this infrastructure up, we will code it and allow Terraform, our pipeline as code solution, to set it up for us!
What is Terraform?
Terraform is an infrastructure as code tool that allows developers to build configuration which defines infrastructure (think of the components in our Slack to Snowflake pipeline) that needs to be created and manages the creation/updating of that infrastructure.
An example working code for this architecture has been created as a starting point and is stored here. Below, we will walk through the significant portions of that code and discuss its use.
Please carefully read the required environment prerequisites that are listed in the README. For the most part, they contain one-time account setup and access actions for the various products in use here. They must be in place for the code to execute without any issues.
In the architectural diagram above, there are four major groups of components:
The bottom three groups all require components that need to be created and maintained with various user-defined attributes. Thus, each of them has been organized as separate Terraform modules within the repository.
This way, they can be separately maintained but still maintain their various dependencies at a modular level. This has advantages from both a code readability and reusability standpoint.
- Readability: Components for each group (AWS, Fivetran, Snowflake) are in separate folders/files and it is clear and intuitive to find them.
- Reusability: Modules can be “implemented” multiple times with different values. So for instance, if there were two Snowflake accounts that you wanted to populate your single Slack workspace data into, you could just define two Fivetran and Snowflake modules with their respective unique account values in the main.tf file. This would create two Fivetran destinations pointing at each Snowflake account, but both would be querying the single AWS module (which queries the single Slack workspace).
Let’s look into each of the groups one by one for a more detailed understanding of how every module is functioning.
Slack is where we are pulling data from, the only setup needed there will be a one-time manual action to create an application and retrieve a bot token as mentioned in our previous article. After that, we will use that bot token as an input to the other modules for authentication back to Slack. This helps us to build a secure connection to Slack and ensure data privacy is maintained.
This module contains the Terraform AWS provider, which is responsible for creating the lambda function, S3 buckets, and IAM policy required to set up the whole AWS environment. These components are segregated into different files where all the required parameters are given. Manual interactions with the AWS console, such as the JSON policy document which was required to copy-paste in my previous article is maintained in the terraform code now.
This module contains the Terraform provider for Fivetran, which is responsible for creating the required connectors, destinations, and groups. This will set up these components exactly as described in the configuration—ensuring that they will be set up in that way every single time the code is executed on the target environment.
The Terraform provider for Snowflake will be used to set up all the users, roles, grants, and databases necessary for ingestion. Automating the setup of these components using Terraform will ensure the Snowflake infrastructure will be set up every time exactly as described in the configuration.
The code which will be executed by the Lambda function is located in the aws/lambda directory. This code has a dependency on the requests PyPi package, which can be installed to that same directory before zipping up for deployment to the AWS Lambda service by terraform.
The terraform Lambda function resource points to the path of the .zip file that contains the packaged lambda code and related dependencies (such as the requests package described above), and then deploys it to the AWS Lambda service.
The fact that the lambda function code is nested within the AWS/directory alongside the associated Terraform code is purely my own design decision. Since that code will be only used by the Lambda Terraform resource, and the fact that they are mutually dependent on each other, we thought it made sense to place that code alongside the Terraform code instead of isolating it elsewhere (but is not necessarily the right/only way to do it).
Local Usage and Testing
This code has been set up to be executed exclusively locally for demonstration purposes. The run.local.sh script has been created to make it easy to manage the local execution for the user but is obviously not the only way to run this code. It does, however, provide a pattern for how this code can be used.
If using the run.local.sh script, a .env file should be created for conveniently setting up the various environment variables containing necessary user-provided values and secrets for the accounts used in this architecture. A template for this .env is provided in the README file of the repository.
Note On Backend Use:
Learn more about them and all the features of Terraform state management here.
Ultimately, the example repository exhibited in this post is best used as a part of an automated CI/CD pipeline. This pipeline could then automatically deploy approved configuration changes to your Slack to Snowflake infrastructure.
At this point is where the advantages of infrastructure as code can be fully leveraged within your organization, allowing your teams to spend more time creating value instead of manually maintaining pipeline infrastructure.
Feedback on the example code included in this post can be provided in the form of a GitHub issue or even a pull request.
If your team is looking to use this code to automate your Fivetran pipelines or set up a similar implementation of this pattern and could use phData’s help and expertise to explore that use case, please do not hesitate to reach out!