April 26, 2024

How to Migrate Hive Tables From Hadoop Environment to Snowflake Using Spark Job

By Rajib Prasad

Seamless data transfer between different platforms is crucial for effective data management and analytics. One common scenario that we’ve helped many clients with involves migrating data from Hive tables in a Hadoop environment to the Snowflake Data Cloud.

In this blog, we’ll explore how to accomplish this task using the Snowflake-Spark connector.

Overview

By harnessing the power of the Snowflake-Spark connector, you’ll learn how to transfer your data efficiently while ensuring compatibility and reliability. Whether you’re a data engineer, analyst, or hobbyist, this blog will equip you with the knowledge and tools to confidently make this migration.

If you don’t have a Spark environment set up in your Cloudera environment, you can easily set up a Dataproc cluster on Google Cloud Platform (GCP) or an EMR cluster on AWS to do hands-on on your own.

Migration Architecture

Before we dive into the technical details, let’s understand the high-level architecture of our data migration diagram:

We have written a Pyspark Job and ran it in the Spark Environment to directly lift and shift the Hive tables from a Hadoop Cluster to Snowflake using the Spark-Snowflake and Snowflake-JDBC connectors

Basically, our Pyspark job will read the data from a particular Hive Table in the Hive database and create a data frame. Then, we will connect with our Snowflake account using credentials and load that data frame directly into a Snowflake table.

How to Load Hive Table Data into Snowflake Tables

Let’s now walk through the step-by-step process of migrating Hive tables to Snowflake using Spark.

Step 1: Spark Environment Setup for GCP/AWS Cloud (optional)

If you don’t have a Spark environment set up in your Cloudera environment, you can easily set up a Dataproc cluster on Google Cloud Platform (GCP) or EMR cluster on AWS using the following simple steps:

Spark Environment Setup on GCP Dataproc

a. Navigate to GCP Console:
Access the Google Cloud Console 

b. Create a Dataproc Cluster:

  • Click on Navigation Menu > Dataproc > Clusters.

  • Create or select an already existing project.

  • Click Create Cluster.

  • Configure cluster settings, including enabling Hive and other required components.

  • Click Create to initiate the Dataproc cluster creation.

c. Access Dataproc Cluster:
Once the cluster is created, you can access it through the GCP Console or by connecting to the cluster using SSH.

Spark Environment Setup on EMR Cluster

a. You can easily set an EMR cluster on an AWS account using the following simple steps:

  • Sign in to AWS Management Console and navigate to the EMR service.

  • Click Create cluster and choose software (Hadoop, Hive, Spark, Sqoop) and configuration (instance types, node count).

  • Configure security (EC2 key pair).

  • Review settings and launch the cluster.

  • Access the cluster via SSH and submit jobs or queries.

  • Terminate the cluster when finished to avoid unnecessary costs.

b. Connect to EMR cluster using AWS SSH:

				
					aws emr ssh --cluster-id <cluster-id> --key-pair-file <file-name>
				
			

Note: If you received this error – ssh: connect to host ec2-*******.ap-southeast-2.compute.amazonaws.com port 22: Connection timed out

Follow these steps to update the Security Group –

  • Go to the EC2 dashboard.

  • Select Security Groups.

  • Find ElasticMapReduce-master.

  • Edit inbound rules.

  • Add a rule for SSH (port 22) from your desired IP range (e.g., 0.0.0.0/0).

  • Save changes.

c. Run hive on the EMR shell to test your environment:

				
					[hadoop@ip-172-*1-*1-***~]$ hive
				
			

Note: We will move forward with GCP Dataproc to explain our data migration process.

Step 2: Hive Table Creation and Data Load

Step 2.1: Define Hive Table

In Hive, create and populate the source table. Use the following HiveQL command as a reference:

				
					CREATE TABLE <database>.<table>(
   -- Define columns based on your schema
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<separator>' STORED AS TEXTFILE;

				
			

The <separator> is the character that separates different fields in each row. For example, if your data is comma-separated (CSV), you can replace <separator> with ,.

Step 2.2: Upload a CSV file

Upload a CSV file from your local to the GCS bucket, which you want to load in your Hive Table.

Step 2.3: Populate Hive Table

Load data into the Hive table using appropriate methods, such as INSERT statements or external data loading.

				
					Example - LOAD DATA INPATH '<gcs_bucket_file_path>' INTO TABLE <database>.<table>;

				
			

Step 3: Spark Environment Setup

Step 3.1: Install PySpark

Usually, Pyspark is pre-installed on Dataproc clusters. If we just want to ensure pyspark is installed, we could run pyspark or pyspark --version. If it’s not installed, then you can use the following command:

				
					pip3 install pyspark
				
			

Step 3.2: Spark-Snowflake Connector Dependencies Setup

As we are using Python + Spark, so we don’t need Snowflake Connector jar download/setup. When launching the PySpark script, include the required Snowflake dependencies using the --packages option. So, you can run the commands from Pyspark CLI.

				
					pyspark --packages net.snowflake:snowflake-jdbc:<snowflake_jdbc_version>,net.snowflake:spark-snowflake_<scala_version>:<spark_snowflake_connector_version>-spark_<spark_version>

				
			

Step 3: Snowflake Table Creation

Define Snowflake Table:

In Snowflake, create a SQL Worksheet and run the following commands to create a database and tables:

				
					CREATE DATABASE <database_name>;
USE DATABASE <database_name>;
CREATE SCHEMA <schema_name>;
CREATE TABLE <database_name>.<schema_name>.<table_name>(
 -- Define columns based on your schema
);

				
			

Note: The snowflake table structure should be the same as the source Hive table.

Step 4: Lift and Shift Data Between Hive to Snowflake

Step 4.1: Open Spark shell

Run the below command to open a Pyspark shell.

				
					pyspark --packages net.snowflake:snowflake-jdbc:<version>,net.snowflake:spark-snowflake_<scala_version>:<version>-spark_<spark_version>

				
			

Step 4.2: Initialize Spark Session for Hive and read Hive Table data

In your PySpark script, create a Spark session with Hive support to read data from the Hive table.

				
					from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("<AppName>").enableHiveSupport().getOrCreate()
result = spark.sql("SELECT * FROM <database>.<table>")
result.show()

				
			

Step 4.3: Configure Snowflake Connection

Initialize Snowflake connection parameters in the script. Replace placeholders with actual Snowflake account details.

				
					sfOptions = {
  "sfURL": "<snowflake_url>",
  "sfUser": "<snowflake_user_name>",
  "sfPassword": "<snowflake_password>",
  "sfRole": "<snowflake_role>",
  "sfDatabase": "<database>",
  "sfSchema" : "<schema>",
  "sfWarehouse": "<warehouse>"
}

				
			

Step 4.4: Write Data to Snowflake

Use the Snowflake connector to write data from the Hive table to the Snowflake table.

				
					result.write.format("snowflake").options(**sfOptions).option("<database_name>", "<table_name>").mode("append").save()

				
			

Step 4.5: Run a select query on Snowflake Table to check the loaded data

				
					SELECT * FROM <database_name>.<schema_name>.<table_name>;

				
			

If the above query runs successfully and returns loaded data, we can create a Spark job.

Step 5: Spark Job Creation & Execution

Step 5.1: Create a .py file with the below code snippet

				
					from pyspark.sql import SparkSession

# Create a Spark session with Hive support
spark = SparkSession.builder.appName("HiveExample").enableHiveSupport().getOrCreate()

# Use Hive SQL to query the table
result = spark.sql("SELECT * FROM myDatabase.payment_table")

# Show the result
result.show()

#Initialize the parameters to access the Snowflake Account.
sfOptions = {
  "sfURL": "https://************.snowflakecomputing.com",
  "sfUser": "********",
  "sfPassword" : "********",
  "sfRole": "<Role>",
  "sfDatabase": "<Snowflake_database_name>",
  "sfSchema": "<Snowflake_schema_name>",
  "sfWarehouse": "<Snowflake_warehouse_name>"
}


result.write.format("snowflake").options(**sfOptions).option("dbtable", "payment_table").mode("append").save()

spark.stop()


				
			

Step 5.2: Execute Spark Job

To execute the .py file, run it using the Pyspark command along with the required packages and configurations:

				
					spark-submit --packages net.snowflake:snowflake-jdbc:<version>,net.snowflake:spark-snowflake_<scala_version>:<version>-spark_<spark_version> your_script_name.py
				
			

Replace your_script_name.py with the full path/actual name of your Python script containing the provided code.

Run a select query on Snowflake to check loaded table data:

Congratulations! You’ve now completed the initial setup for migrating your Hive Table data to Snowflake using Spark Job.

Key Takeaways

Here are some key takeaways from the blog’s process for migrating Hive tables to Snowflake using Spark:

  1. Understanding the Architecture: Before diving into the technical details, it is important to understand the high-level architecture of the data migration process. This includes knowledge of the components involved and how they interact with each other.

  2. Spark Environment Setup: Depending on the Environment (Cloudera, GCP, or AWS), a Spark environment setup is necessary. This ensures you have the necessary infrastructure to run Spark jobs for data migration.

  3. Create a Snowflake Table: Define the target Snowflake table in the Snowflake database. The table structure should match the Hive source table structure to ensure compatibility.

  4. Data Migration (Lift and Shift): Use Spark to read data from a Hive table and write it to a Snowflake table. Configure the Snowflake connection parameters in the script and use the Snowflake connector to perform the data transfer.

  5. Spark Job Creation & Execution: Create a Python script containing Spark code to read from Hive and write to Snowflake. Run the script using the Pyspark command along with the required packages and configurations. We are using  "spark_version": "3.3.2", "scala_version": "2.12.18", "java_version": "11.0.20.1" and "python_version": "3.10", so we can use a suitable Spark-snowflake and Snowflake-jdbc package (net.snowflake:snowflake-jdbc:3.13.5, net.snowflake:spark-snowflake_2.12:2.9.0-spark_3.1). You can visit Maven jdbc and Spark-Snowflake sites to get your supported package versions.

  6. Testing and Validation: After executing the Spark job, run select queries against the Snowflake table to verify that the data was loaded successfully. This step ensures the integrity and accuracy of the migrated data.

Closing

By following these steps, you can seamlessly migrate data from Hive tables in Hadoop to Snowflake using Spark, enabling efficient data management and analytics in the cloud.

Is your organization going through a migration to Snowflake?

The experts at phData can help! Schedule a free Snowflake migration assessment today for answers, advice, and next steps to ensure your migration is a success.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit