July 1, 2020

Building Data Pipelines with AWS CloudFormation

By Nick Goble

Even in a modern cloud or hybrid platform, using the console in a manual way to launch infrastructure is an anti-pattern for creating production-ready infrastructure as it takes away the ability to do testing and provisioning in a repeatable way. Typically the traditional infrastructure provisioning and manual deployment process often delays the overall delivery pipeline essentially due to lack of testing and longer feedback loops.

Continuous Deployment is one of the fundamental principles of DevOps in which software features can be deployed to various environments through an automated process. In this blog, we’ll briefly discuss the importance of infrastructure-as-code and then focus-in on how it meshes with Continuous Deployment.

Infrastructure-as-code

Whether you are building a machine learning pipeline, cloud native data warehouse, an IOT platform, or a data lake, these often depend on a complex foundation of cloud services including networking, security, storage, and many other infrastructure components. Integrating these components in the deployment pipeline can be a complex task to automate if done manually. 

Infrastructure-as-code (IaC) allows you to construct your infrastructure and application resources using development methodologies, making it easy to provision your infrastructure quickly and in a repeatable way without any manual actions.

Treating infrastructure changes as code gives the same benefit as with the typical application build process: pushing changes to a code repository, versioning, and integration with the Continuous Integration (CI) pipeline. This concept becomes even more crucial in cloud native environments, where applications could be deployed into production hundreds of times a day.

Building a data pipeline with AWS CloudFormation

Continuous deployment is a fairly new concept that is still evolving and it can often be overwhelming for teams to apply IaC principles to achieve continuous delivery benefits to cloud infrastructure. Getting your infrastructure into production involves a lot of moving parts within the developer workflow:

  • How the new solution integrates with other tools within the organization 
  • How will the new infrastructure code be tested 
  • Who approves the changes
  • How the infrastructure gets automatically provisioned


On top of these questions, the choice of a build tool can be critical too, and is often affected by factors like integration with the cloud platform, organizations’ DevOps maturity level, or in-house technical capabilities.

For instance, in the AWS cloud platform, CloudFormation is often an obvious and convenient choice that allows infrastructure provisioning based on templates written in YAML or JSON format. It offers better integration with related AWS services, and the built-in dependency/state management features allow easy resource management, along with rollback options in case of failures.

However, choosing the right tools solves just one piece of the puzzle. There often are other integration challenges specific to your use case and dependencies. Cloud services offer a lot of functionality out of the box, but you are also tied to the functionality provided by those services; you often don’t have the flexibility to fix gaps yourself.

Introducing phData's Cloud Foundation

So here’s how phData can help. Our continuous integration and delivery tool Cloud Foundation lets you build infrastructure seamlessly. Cloud Foundation automates infrastructure builds, eliminates errors, and eliminates manual console intervention. All your cloud infrastructure automation is stored in a repository, which supports dynamic templates generated through Troposphere and Sceptre. 
DevOps Workflow Using Infrastructure as Code with Cloud Foundation

DevOps Workflow Using Infrastructure-as-Code with Cloud Foundation

The process for introducing new infrastructure resources looks like this:

  1. Developers and engineers commit the infrastructure as code to the source code repository. When the changes are ready to deploy, the developers create a pull request. 
  2. Cloud Foundation pull request action is triggered and posts a comment to the pull request, which details the exact changes that the pull request will do when it is merged. 
  3. The approver of a pull request reviews the comment posted by Cloud Foundation and approves the change. 
  4. Finally, when the code is merged to the master branch, the pull request merge action is triggered.  This deploys the infrastructure changes in the cloud, and posts back a message with a summary. 


Cloud Foundation also allows you to create your cloud resources in a more flexible manner.  AWS has a lot of functionality, but it takes time for all of the functionality to be available within CloudFormation.  For example, in order to configure S3 Bucket Notifications within CloudFormation, you would need to know the physical id of your S3 Bucket before it was created.  You could manually assign a physical id to your bucket, but this is against CloudFormation best practices.  Since Cloud Foundation is built on top of Sceptre, this dependency tree is automatically managed for you, and will allow you to reference your bucket correctly. Other issues such as enabling Redshift VPC enhanced routing and Glue resource policies are not as easily solved.  For these issues, you can use Cloud Foundation to configure CloudFormation custom resources in tandem with a Lambda function to run after your resources have been created.  While this works, it adds additional complexity and fragility to your code.

As phData continues to help customers migrate into cloud environments, we continue to identify and guide our customers through solutions to these gaps. While custom tooling is currently the main way around these issues, AWS prioritizes these issues based on votes on their Github issues. If you’ve been impacted by any of these, or any other CloudFormation issues, please go and vote at CloudFormation Coverage Roadmap.

Verdict

There is no one-size-fits-all solution for infrastructure-as-code. Every business scenario is unique, and a lot of factors come into play like an organization’s preference for cloud vendors, multiple vs hybrid cloud setup, open source vs proprietary solution to name few.

At phData, we try to help you make this decision stress free. Our Cloud Foundation solution makes it easier to get the most out of AWS CloudFormation, and enables better data governance by limiting user access to the AWS console; all of your infrastructure is only managed through code. All of this is achieved by 100% percent AWS native tools, and can be further customized depending on the requirements.

Want a demo of phData Cloud Foundation? Reach out.

This post was written by Nick Goble and Satya Panthri.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit