At any large enterprise, there are often a number and variety of applications. The phData approach when deciding on a Cloudera migration is to start with an inventory of applications and tools. This provides a deep-dive analysis of what types of systems and processes will be needed within AWS, to support existing requirements. This can be
a challenge because engineers migrate to other jobs or companies, and systems get left behind. Determining everything that is currently running in an existing data center is extremely difficult and is one of the most important pieces to understand.
Once an inventory has been taken of the existing environment, the next step is to determine what AWS technologies, or 3rd party tools we can utilize within AWS to support the existing applications. This can also be a challenge because AWS data and analytics native services don’t cover everything that Cloudera currently offers. One example is the role-based access controls from Apache Sentry. Another is that of Kudu, for supporting analytics use cases. Although there is some alignment between Kudu and Redshift, there are also quite a few gaps when comparing the two.
phData is the leader in end-to-end services for machine learning and data analytics. The customer’s use case of migrating from an on-premise Cloudera cluster to AWS was a use case that phData has solved for many times. phData could bring valuable insights into the architectural and implementation challenges that come with a large data platform migration. The customer saw the vast experience that phData brings to developing solutions in the data technology space and chose phData to deliver a new data platform to fit their current and future needs.
The customer had many existing applications that utilize data within Cloudera today. The new solution needed to meet those requirements and provide enough value-add to prove a business case within the company. The solution solved a number of different challenges including user authentication and authorization and other general security concerns, scheduling spark jobs, and providing visibility into data on the platform.
EMR was chosen because of the familiarity the customer had with running Hadoop within the data center. They also had a requirement to execute Spark applications which were primarily used for transformation of data and extraction of insights within that data. Existing Spark applications running on-premises in Cloudera could easily be migrated to execute within the EMR environment.
Another consideration to picking EMR was the varying applications running within the on-premises Hadoop cluster. Although it would be an easy solution to simply execute the Spark applications within AWS Glue, existing applications were reliant on a number of additional services including Sqoop, Pig and Hive. The deciding factor for choosing EMR was really the fact that the cluster and underlying compute instances would be managed by AWS.
The customer utilized many different methods for workflows within Cloudera in regards to data ingestion and processing. Some teams utilized automated methods, and others utilized manual methods like directly publishing CSV documents to HDFS. The approach taken for the migration was to automate as much as possible. phData built a solid approach to deploying and utilizing Apache Airflow for data workflow orchestration.
The approach used for applications migrating from Cloudera to the new system was to use AWS Data Pipeline to manipulate and move the data within the AWS ecosystem. Then, we took advantage of Airflows orchestration capabilities to manage end-to-end workflows and integration of 3rd party services such as StreamSets. This provided an easy solution for building simple ETL processes using Data Pipelines and EMR jobs, as well as managing complex scheduling requirements using Apache Airflow.
It’s important to note that we opted to use a long-running EMR cluster to manage the varying workflows required by existing applications and the requirements from the data science teams to run ad-hoc Spark jobs within EMR. However, since data was stored using S3 and not on local storage within the EMR cluster, scaling the cluster manually (using CloudFormation) was fairly straightforward. The cluster itself is made up of m4.xlarge instances which meets the initial computing demands.
Managing Cloudera in the data center is very different from cloud-native solutions. In a lot of cases, there are plenty of manual processes involved. Because of this, phData developed two distinct solutions to solve the management of EMR and support technologies in the cloud. phData believes that automating lets you focus on more strategic work for your business and can reduce operational costs. For this customer, phData utilized Cloud Formation and Jenkins to automate the deployment of all of the data platform infrastructure. Engineers can push Cloud Formation changes to Git and those changes get deployed using Jenkins which updates, creates, or deletes the various Cloud Formation stacks. The various components were split into multiple Cloud Formation templates. This allowed for the separation of duties when engineers were working on Redshift, or others were working on EMR.
With Cloudera, the customer has a tightly coupled integration with Apache Sentry and Active Directory, which currently houses all developer and data scientist permissions including which data sets respective groups of users have access to and which services and tools they have access to. Because of the way they provide access to developers in AWS, IAM provided a deeply integrated solution for providing additional access to EMR. IAM Roles were developed to provide specific access to EMR and S3. This would limit the blast radius of security issues in the event that any account was compromised. These roles were directly integrated with specific users in Active Directory, and utilized with AWS Single Sign-On.
Cost was a key consideration for the migration, but it wasn’t the primary factor because the new platform needed to meet the existing platform requirements.
Some analysis was done on utilizing reserved instances vs spot instances. For batch use cases, spot instances will be used. This was a fairly easy decision because most of the workloads are not time-sensitive, and can execute overnight. Given the costs of spot moved out of the defined range, moving to EC2 reserved instances was easy to fit into the pipeline.
Although there are not currently any streaming use cases, reserved instances will be used. It’s important to note that the customer is currently taking advantage of AWS Savings Plans to save costs on EC2, so reserved instances are not actually being utilized.
The customer needed a scalable data transfer process to support many different data sources from on-premises systems and many different methods for ingesting data into the current Cloudera cluster. Ingestion was done via Streamsets, Sqoop and even manual processes. So, S3 and Streamsets were chosen to support the existing application requirements.
S3 can support heavy data flow over rapid succession or a continuous flow of data over data. Streamsets was chosen as a solution because they have many other data pipelines already utilizing Streamsets and wanted to continue their existing process already defined with the technology.
With regards to actually migrating data, S3DistCp was chosen to handle bulk data transfers from the on-premise HDFS into S3. They do not currently have any streaming data requirements, and thus, this was not taken into account.
S3 was chosen as the data storage service because it can scale to meet the demand for any number of data workloads. In comparison to managing storage on HDFS within EMR, S3 is certainly more cost-effective for longer-term storage. The company currently takes advantage of Impala SQL and Hive on its existing Cloudera cluster. For this use case, Redshift was chosen as a data query solution. phData has developed a solution called SQLMorph that automatically handles conversion of Impala to Redshift and other SQL dialects. This allows for the easy migration of Impala to Redshift for applications utilizing Impala.
The metadata storage approach within Cloudera on-premises was to use Hive. The approach that was taken for the migration of schema and metadata information was to utilize Hive’s SHOW CREATE TABLE statements. The schema information was extracted using these statements and modified to support S3.
Glue Catalog was used as the storage mechanism. Glue Catalog is, however, missing some fundamental requirements within the data cataloging space, such as data lineage. These requirements will be addressed at a later date as they were not urgent for the initial migration. As a stop-gap solution, Cloud Trail logs provide audit and specific lineage query requirements.
Learn how phData can help solve your most challenging data analytics and machine learning problems.
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.