Case Study

Cloud Formation at a Top U.S. Life Insurance Company

Executive Summary

A top-10 U.S. life insurance company needed to complete multiple large-scale data engineering projects to orchestrate workflows for large amounts of data. One of the biggest challenges was to develop a repeatable pattern for managing infrastructure to support the scale they operate at. With hundreds of engineers and projects going on, they also needed a way to ensure governance and a method for ensuring some minimal security requirements. As part of this initiative, phData delivered an automated pipeline for continuous deployment of Airflow and the underlying infrastructure, namely ECS, for managing data engineering pipelines.

Workload

The Airflow Cloud Formation templates are divided into two sets. The first set defines all of the resources that the Airflow master service requires. This includes an S3 bucket and an SQS queue for managing admin workflow jobs, a shared KMS key for encryption throughout the entire system, IAM roles and policies that provide Airflow access to utilize services such as KMS and Secrets Manager, and of course, the ECS cluster that the Airflow master containers run on. 

The ECS cluster includes a worker container, a web container, and a scheduler container. These are all defined within the Cloud Formation templates. There are 5 templates that make up the Airflow master templates which include one for IAM roles and policies, the ECS cluster and task definitions, KMS keys and policies, RDS, and the EFS service. This allows the IAM roles and policy changes to be confirmed by additional stakeholders while letting the development team iterate changes quickly on the other services.

All of these templates take advantage of Cloud Formation stack outputs and the Fn : : ImportValue function. For example, the ECS template is the final stack to be deployed, as it relies on stack outputs, such as IAM role ARNs and the EFS service name in order to be deployed.

Each data engineering project will also get its own Cloud Formation stack. This is a single template that defines an ECS worker cluster with the worker task definition. It also includes specific IAM roles for the cluster and task that give the project permissions to only the required services the project needs. If one project is focused on machine learning objectives, that project infrastructure can be given specific SageMager permissions, while an EMR initiative can be given limited EMR permissions. This limits the security scope for every project.

All of the Cloud Formation templates utilize specific data types in parameters when appropriate. For example, the ECS cluster relies on subnets to be deployed into, and thus, the parameter for Subnet Ids is of type List, which is then passed using the Ref function to the Auto Scaling Group resource.

Deployment

The deployment process for Cloud Formation templates utilizes Jenkins as a primary service. All templates are stored within the on-premises Github enterprise solution. This provides various levels of security prior to any code actually getting deployed. Namely, no one person can merge to master branches. All merges require the approval of another member within the active directory group associated with the repository. 

When code is pushed to any branch, a Jenkins job is kicked off to execute cfn-lint processes which ensure the template is well-formatted. It also executes cfn_nag processes to ensure the template meets standard security practices, such as not having a security group open to 0.0.0.0/0 or to ensure IAM policies include a permissions boundary. There are also some standard naming conventions for certain resources that are enforced.

Each repository contains branches for every environment the application stack supports. In the case of our Airflow solution, we have the dev branch, qa branch, and the master branch, which represents the production environment. There are configuration files that contain the Cloud Formation input parameters for each environment. These configuration files also utilize a parameter to determine which account the deployment should happen in. For example, there is a different account alias configured in the dev branch versus the production branch.

Results

The templates and deployment model worked extremely well for the deployed solution. It enabled the separation of duties from a security perspective, and it provided a repeatable deployment pattern for engineers to make continuous changes to the environment with little effort.

The solution also allowed for appropriate governance. For example, in order to deploy to production, Cloud Formation Change Sets are created, and the Jenkins pipeline will hold the execution of those Change Sets while waiting for approval from appropriate parties. Overall, the solution provided value to the customer by giving them a repeatable process to deploy the Airflow stack, but also, an understanding of how to utilize this pattern for other deployments.

phData was able to demonstrate the value of automation. In regards to security, there was structure around security groups, IAM policies, and other security-related services within AWS. Engineers could pull code from already developed templates and patterns for those services to ensure they were following best practices. The linting solution also provided automated guardrails.

The customer no longer required a static EMR cluster. Using Airflow and Cloud Formation, EMR would only run during times where data processing was required which drastically reduced instance costs.

Take the next step
with phData.

Learn how phData can help solve your most challenging data analytics and machine learning problems.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit