Choosing the right workflow and scheduling solution for your cloud data project is an important topic. There are many trade-offs and considerations when it comes to scaling, ecosystem support, error handling, and security. You want to ensure that you’re choosing a technology that can handle the complexities of data management for your projects.
phData has spent a lot of time exploring and implementing multiple data workflow technologies including cloud-native data workflows within AWS. This post will cover two specific technologies, AWS Data Pipeline and Apache Airflow, and provide a solid foundation for choosing workflow solutions in the cloud.
AWS Data Workflow Options
AWS Data Pipeline is a native AWS service that provides the capability to transform and move data within the AWS ecosystem.
Apache Airflow is an open-source data workflow solution developed by Airbnb and now owned by the Apache Foundation. It provides the capability to develop complex programmatic workflows with many external dependencies.
Scheduling, Workflow & Orchestration
AWS Data Pipeline
Data Pipeline supports simple workflows for a select list of AWS services including S3, Redshift, DynamoDB and various SQL databases. Data Pipeline also supports scheduling work or “tasks” on EC2 instances. In regard to scheduling, Data Pipeline supports time-based schedules, similar to Cron, or you could trigger your Data Pipeline by, for example, putting an object into and S3 and using Lambda.
Data Pipeline struggles with handling integrations that reside outside of the AWS ecosystem — for example, if you want to integrate data from Salesforce.com. You can trigger 3rd party integration steps from an EC2 script task, but it’s up to you to manage the integration points and error handling between Data Pipeline and the external services. Our experience has been that this results in extra effort and maintenance.
One last important item regarding workflow and Data Pipeline is that AWS Glue has similar capabilities. Glue is a serverless Spark execution framework meant specifically for ETL workflows. If you happen to be utilizing Data Pipeline to execute ETL within EMR and don’t require additional services such as Hive or HBase, then Glue is a very good alternative. AWS manages the underlying compute resources for Glue, meaning it’s scaled to meet the needs of your ETL workflow. With Data Pipelines and EMR, you are required to size the computing power appropriately.
Airflow is a workflow and orchestration tool that has the capability to orchestrate tasks that reside inside as well as outside of the AWS environment. Airflow does this by providing a large open-source library of plugins for 3rd party vendors such as Salesforce and AWS. This positions it as a tool that can help manage services such as AWS Data Pipelines or AWS Glue. Because Airflow runs on virtually any compute environment, it can support multiple different cloud vendors outside of AWS.
Another huge benefit of using Airflow is the approach to developing workflows. It uses Directed Acyclic Graphs, or DAGs for short, to define tasks and dependencies. DAGs are written in Python and can support very complex dependencies within a workflow. This allows an engineering team to develop workflows using a standard software engineering process. Developers can push code to Git, and these DAGs can be deployed to the Airflow cluster. You can even follow a versioning and release process.
AWS Data Pipeline only supports DynamoDB, SQL (e.g., Postgres), Redshift, and S3 as data sources within the pipeline. Additional sources, say from 3rd party/SaaS vendors, would need to be loaded into one of these stores to be utilized in the pipeline.
Airflow has many connectors to support 3rd party technologies, including AWS services like EMR, DynamoDB, Redshift, and S3. See a full list here.
Infrastructure as Code
With AWS Data Pipeline, you can define all of your infrastructure, including the pipeline itself, with Cloud Formation. The infrastructure to support Airflow can be defined using Cloud Formation (already developed by phData as part of CloudFoundation) and the workflows are defined using Python. These could follow a very standard and proper CI/CD process. The phData model for deploying Airflow utilizes Elastic Container Service (ECS) and executes DAGs on explicitly permissioned work nodes.
Data Pipeline is built on the AWS platform and has deep integration with IAM. All authorization permissions are defined within IAM as policies for Data Pipeline. This means all services that are a part of a workflow can contain IAM roles as well. It also means that, at the service level, those workflow services can be given explicit permissions to the data they’re allowed to access. Permissions can be defined at the pipeline level as well.
Airflow supports LDAP integration, which includes Role-Based Access Control at the UI level. Airflow also supports DAG-level authorization. The benefit of this system is that it integrates very well within the existing technologies—Active Directory, for example—that are used at most large companies for managing user access.
It’s also important to note that, with the way phData deploys an Airflow cluster, each worker node that handles the execution of a DAG can have explicit permissions defined so that each DAG only has access to specific data sets and AWS services. Each worker is deployed with only the permissions necessary for that particular DAG, which allows for a better implementation of a least privilege security model. Additionally, this means that you can limit the developers with access to this workflow, which lowers the risk of inadvertent or rogue developers doing something they shouldn’t and affecting data quality or stability.
Recommendation: Apache Airflow (plus Cloud Foundation from phData)
phData recommends Airflow as the workflow and orchestration service for data pipelines, given its support for pulling together many different external dependencies into your ingestion process, including StreamSets and ETL pipelines within AWS. Additionally, Airflow’s ability to start jobs, handle multiple DAGs in parallel and backfill DAGs provides capabilities not found in AWS Data Pipeline. If you need integration with an existing scheduling solution, you can use the Airflow API over HTTP(S) to trigger and monitor DAGs. However, it’s important to note that Airflow’s scheduling capabilities certainly hold their own and can be used without bringing another scheduler into the mix.
You can also use AWS Data Pipeline or Glue for ETL processes like moving data from S3 to EMR to execute Spark jobs and loading data into Redshift, but these processes could also be managed as individual steps by Airflow for better visibility into all of the details of a data pipeline. Airflow could also orchestrate, at a higher level, Glue jobs or Data Pipeline executions.
phData has built a tool called Cloud Foundation to help you manage the infrastructure. It reduces the amount of time you spend writing the infrastructure-as-code to deploy Airflow in a repeatable way.
Ultimately, use the workflow tool that makes the most sense for your pipeline requirements. Practically speaking, we’ve found that for “quick and dirty” pipelines using AWS services Data Pipelines works well. However, as pipelines become more complex, our experience has been that error handling, debugging, and 3rd party tool/frameworks support become more important, and we typically recommend Airflow. For more information, reach out to us to speak to our technical experts.