A Step-by-Step Guide for a Successful Migration
This document is intended to serve as a general roadmap for migrating existing Hadoop environments — including the Cloudera, Hortonworks, and MapR Hadoop distributions — to the Snowflake Data Cloud.
Each distribution contains an ecosystem of tools and technologies that will need careful analysis and expertise to determine the appropriate mapping of technologies that will ultimately best serve any given use case.
The Hadoop platform represents a. broad suite of technology offerings, requiring architects and engineers to select the right tool for the job. Because of this, businesses see a limited ROI from what they can derive from their data due to poor query performance, difficult to use tools, and bulky execution engines in Hadoop.
Managing both on-premises and cloud-based Hadoop clusters requires a dedicated infrastructure administration team to handle upgrades, security patches, capacity planning, and more.
That means successfully using Hadoop requires deeply technical users and administrators which are hard to find and highly expensive.
That’s why Hadoop users are moving to Snowflake.
The Snowflake Data Cloud was designed with the cloud in mind, and allows its users to interface with the software without having to worry about the infrastructure it runs on or how to install it. Between the reduction in operational complexity, the pay-for-what-you-use pricing model, and the ability to isolate compute workloads there are numerous ways to reduce costs associated with performing analytical tasks. Some other benefits and capabilities include:
Snowflake is built on public cloud infrastructure, and can be deployed to Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). When moving to Snowflake, there are some considerations regarding which cloud platform to use. Refer to the Snowflake documentation to assist with choosing the right platform for your organization. For the purposes of this migration plan, AWS technologies will be used when options are available.
phData is a Premier Service Partner and Snowflake’s Emerging Partner of the Year in 2020. If you’re looking to migrate, phData has the people, experience, and best practices to get it done right. We’ve completed nearly 1,000 data engineering and machine learning projects for our customers. So whether you are looking for architecture, strategy, tooling, automation recommendations, or execution, we’re here to help!
Snowflake is both a data lake and a cloud data warehouse for semi-structured and structured data. These are not entirely separate technology frameworks in direct competition with one another. A data lake generally contains the raw data required to build a data warehouse.
Data lakes have risen in popularity with the rise of big data technologies like Hadoop. Most Hadoop platforms start with a specific use case involving data warehousing already in mind; however, to actually accomplish that use case, the Hadoop platform turns into a data lake.
This is because data gets ingested into Hadoop from heterogeneous sources in many different formats, including structured, semi-structured, and unstructured data. Semistructured and unstructured data often need to be reformatted in order to be consumed by a data warehouse.
Features
Data Warehouses date back to the 1970s, and they became the norm in the late 1990’s. Originally implemented in Relational Databases, the data is tabular, easy to query, and allows reports to be built on top of it.
Features
Best Practice – Snowflake is both a cloud data warehouse and data lake which is ideal for semi-structured and structured data. phData’s recommendation for truly unstructured data (e.g. videos, pictures) is to store it in the cloud object storage and then analyze it in Snowflake via an external read from the object storage layer. This allows you to make sense of the raw data and transform it into something a data warehouse can analyze and use.
Nine out of ten of phData’s customers use Hadoop for analytical workloads where the sources are mainly relational or tabular data, such as spreadsheet or delimited data. These analytical workloads are perfect targets to migrate to the Snowflake platform. Hive and Impala scripts can be easily migrated to Snowflake, and can be run either in Worksheets or via SnowSQL from the command line interface. phData also offers custom-built software to automate the translation of SQL dialects from Impala to Snowflake SQL.
Snowflake offers various connectors between Snowflake and third-party tools and languages. One of these is a Spark Connector, which allows Spark applications to read from Snowflake into a DataFrame, or to write the contents of a DataFrame to a table within Snowflake.
Please note that Apache Spark applications will need to be reviewed and fully understood before migrations can occur. If SparkSQL is involved, the SQL code will be able to run in Snowflake. phData engineers are trained in converting apps like these; however, appropriate A/B testing will still need to be conducted.
On the other hand, online use cases that require lookups by key in sub-second might instead require another tool in the cloud ecosystem, depending on the customer’s choice of vendor.
HBase and MapR DB, for example, are both distributed column-oriented databases built on top of the Hadoop file system. HBase serves a similar purpose as MapR DB, and has a very similar migration path. Use cases that are primarily OLTP driven will be converted to streaming applications utilizing Azure CosmosDB, DynamoDB or Google BigQuery for storage, while analytics applications can be migrated to a streaming application into Snowflake. Snowflake announced a new search feature that allows quick lookups for given keys, which will eventually be able to replace the aforementioned technologies. Until then, phData can assist with a migration from HBase/MapR DB to CosmosDB/DynamoDB/BigQuery.
Hadoop Technology | Snowflake Native | Risk | Verification | Cloud Technology |
---|---|---|---|---|
HBase | Consult phData; Investigate Snowflake Search | High | Will it meet SLAs? | DynamoDB CosmosDB Google BigQuery |
Hive | Yes, with SQL Transformation | Low | Snowflake | |
Impala | Yes, with SQL Transformation | Low | Snowflake | |
Kafka | Yes | Medium | Kafka Connect with variant fields | Apache Kafka Kinesis Azure Event Hubs |
Kudu + Impala | Yes, with SQL Transformation | Low | Snowflake | |
Solr | Consult phData; Investigate Snowflake Search | High | ElasticSearch Azure Search Dataproc Solr | |
Spark | No; but Snowflake Connector exists | Medium | Dependant on the workload | AWS EMR, AWS Glue Azure DataFlow GCP DataProc |
SparkSQL | Streams and Tasks with SQL Transformation | Low | AWS EMR, AWS Glue Azure DataFlow GCP DataProc |
The goal of the migration from Hadoop distributions to Snowflake is to offer customers a straightforward path to becoming cloud-native and to save on licensing and hardware costs. Ultimately, migration plans and timelines can vary significantly, depending on the size of the organization and the complexity of the use cases.
But this section will outline a general, top-level strategy (applicable to most organizations) for planning and executing a migration over three distinct phases: Discovery, Implementation, and Validation.
The goal of this phase is to gather all the necessary context and background information about the current Hadoop environment, and to identify all the relevant dependencies — including tools and technologies, data sources, use cases, resources, integrations, and service level agreements. The outputs from these investigations will be critical for informing the final migration plan.
Creating a complete inventory of all tools and technologies used in the current Hadoop environment is critical to creating a successful migration plan. This inventory should include tools native to the Hadoop environment, as well as any relevant third-party vendor software.
From this inventory, phData will reverse engineer a current-state architecture detailing the organization’s existing Hadoop landscape. Then, working with the customer, we will identify whether each tool is still needed in the go-forward strategy, or whether it can be replaced by the capabilities and benefits built into Snowflake. From there, those tools that will be retained have to go through an evaluation process in order to fully understand how they work in a cloud-native Snowflake environment.
This technology inventory should be categorized as follows:
The actual data sources are what deliver results and provide value. That makes them just as important as the tools and technology inventory.
These sources may be external — such as Databases, ERP, CRM, files, streaming or event data — or they may be internal sources used for integrations feeding data out of Hadoop.
The discovery phase is an ideal time to identify any data sets that might be deprecated and don’t need to move to the cloud. You should identify each data source to ensure it has an owner or subject matter expert who can be available to answer questions and provide a detailed understanding of data sets.
phData will work with these data experts to complete the following list of questions:
General Data Source Questions
Relational Database Questions
MapR Streams or Kafka Questions
Delimited File Questions
API Integration Questions
The most important step of the discovery phase is to fully understand the applications running in your current Hadoop environment. This will better define the amount of effort it will take to migrate your environment to the cloud.
Because Hadoop comprises a variety of tools and services that can make up an application, there are some important distinctions to make about how the application is deployed. phData will work with you and each application owner to understand the following:
Internal stakeholders and existing end-users tend to be concerned about change. Identifying the users of the current platform and providing them with the necessary training around Snowflake is crucial to accelerating platform adoption. Understanding tools and processes in their daily workflow and making alternatives available will make sure that these people feel comfortable using Snowflake.
It’s important to inventory applications accessing the Hadoop environment in order to ensure that the cloud-native Snowflake environment can serve data to them. In many cases, these applications have complex access patterns for authentication and authorization which will need to be evaluated. However, external applications connecting using JDBC or ODBC can use these drivers when communicating with Snowflake, and should work out of the box.
As mentioned before, there are also Connectors available to integrate with Python, Spark, Kafka, and more.
Finally, an inventory of current security and governance configurations must be conducted. Because of the different architectures that each Hadoop distribution puts forth, the scope could broaden to include more tools, such as Sentry, Ranger, and HDFS ACLs. In addition, the use of Kerberos, Active Directory, encryption-at-rest and encryption-in-transit must all be taken into account. And for MapR, filesystem permissions, ACLs and Access Control Expressions must likewise be assessed.
phData has developed an automation tool called Tram, which generates databases, schemas, roles and security grants in Snowflake. Tram is able to integrate with on-premises Active Directory as well as Azure AD; alternatively, it can also create users directly in Snowflake, and this latter method can be used within a git workflow.
The implementation phase focuses on executing the process of migrating business applications into Snowflake from the existing Hadoop environment. Based on the inputs gathered during the discovery phase, a list of data sources, applications, and tools will be selected and prioritized for migration.
From there, phData will staff teams of architects and data engineers (according to the needs of the project) to work with the customer throughout the migration effort. This phase will be the longest and most technical to execute.
Using the Data Source inventory, engineers will move data stored in HDFS to the cloud vendor’s storage layer (Blob Storage or S3). The same information architecture will be applied in the new cloud storage file system, and the resulting folder and file structure should be a one-to-one match with that of the HDFS file system.
Since use cases and implementations vary greatly from customer to customer, phData does not see any generalized quick-win migration tool as being feasible for the implementation phase. However, our engineers do build code and tooling to be used between migrations as patterns develop and new opportunities arise. We then provide it to the broader Snowflake community.
For example, we’ve developed a tool called SQLMorph to instantly translate Hadoop SQL to Snowflake SQL, which eliminates a usually time-consuming, error-prone, and highly manual process.
A significant challenge to the storage migration for both Cloudera and MapR implementations will be migrating role-based access controls to the new cloud storage system. Depending on the cloud vendor, the configuration of role-based access is different. Azure Blob Storage uses Active Directory, with groups and users to grant access to blob containers, whereas Amazon AWS uses IAM policies to configure access to S3 buckets. Accordingly, a tool will need to be developed to migrate these policies for each cloud vendor.
The final phase will be to validate the outcome of the migration from Hadoop to Snowflake. This step should be performed using traditional A/B testing. For a time, the customer will need to continue running the existing Hadoop implementation alongside the new cloud-native Snowflake offering. Meanwhile, validation scripts and processes will be developed to ensure the delivered results in Hadoop match with those in Snowflake. After all checkouts have been performed and applications and use cases are cleared, they can be shut down in Hadoop.
After this phase, you should be able to completely remove or repurpose your Hadoop infrastructure.
The diagram below outlines our validation process.
The first validation point is ensuring the data in your existing Hadoop environment matches the data in your new Snowflake environment. To do this, we will run a query in Hadoop, run the same query in Snowflake, and ensure the data is an exact match. We use a suite of tools to automate this process.
Once the data is validated, we move to our technical validation phase to ensure the Snowflake environment has the required integrations that your development teams expect.. This includes checking that the right CI/CD, source control, and project management processes are in place and that all tools that need to consume data are pointed to the right location.
Finally, we’ll complete a business validation. This is meant for you to greenlight the work we’ve done and confirm that it meets both your expectations and solves the critical business needs we outlined at the beginning of the project.
A leading manufacturer of mining and earth-moving equipment sought to increase top line revenue through new products and services, including smart-connected equipment and post-purchase proactive maintenance services. To accomplish this, they needed to transform their existing sensor-based analytics platform into a more efficient, centralized IoT data solution.
The manufacturer knew they wanted to take advantage of the latest cloud-native technologies. But they needed help choosing those technologies, executing a successful migration from their existing Hadoop solution, and ensuring the new solution could handle the high volume of IoT data transmitted daily from their equipment sensors.
phData designed and built a new cloud-native solution for IoT, based on Snowflake, as well as Spark, Kafka, and Microsoft Azure — with automated infrastructure provisioning using infrastructure-as-code, CI/CD for automated deployment, and an architecture that supported dynamic scale and fault tolerance. Then they helped the manufacturer successfully migrate their application from Hadoop to Snowflake, validating the new platforms in-production viability.
The manufacturer transformed what started out as a small web application into a unified IoT data store, analytics, and visualization platform — designed and optimized by phData to maximize the value of Snowflake’s cloud-native architecture.
By the numbers:
Given the complexity of Hadoop migrations, consider seeking help from phData with expertise and years of hands-on experience migrating Big Data. phData specializes in data, ML, and long-term success with Snowflake, and is proud to have been named Snowflake’s 2020 Emerging Partner of the Year.
Behind every successful migration lies a trusted checklist to help organize and streamline the process. Kickstart your migration today by downloading our detailed Hadoop to Snowflake Migration Checklist.
Subscribe to our newsletter
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.